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WO 97/32999 PCT/US96/ 18708 

SIMULTANEOUS SEQUENCING OF 
TAGGED POLYNUCLEOTIDES 

5 Field of the Invention 

The invention relates generally to methods for sequencing polynucleotides, 
and more particularly, to a method of sorting and sequencing many polynucleotides 
simultaneously, 

10 BACKGROUND 

The desire to decode the human genome and to understand the genetic basis of 
disease and a host of other physiological states associated differential gene expression 
has been a key driving force in the development of improved methods for analyzing 
and sequencing DNA, Adams et al, Editors, Automated DNA Sequencing and 
1 5 Analysis (Academic Press, New York, 1994). Current genome sequencing projects 
use Sanger-based sequencing technologies, which enable the sequencing and 
assembly of a genome of 1.8 million bases with about 24 man-months of effort, e.g. 
Fleischmann et al. Science, 269: 496-512 (1995). Such a genome is about .005 the 
size of the human genome, which is estimated to contain about 10^ genes, 15% of 
20 which-or about 3 megabases--are active in any given tissue. The large numbers of 
expressed genes make it difficult to track changes in expression patterns by sequence 
analysis. More conrmionly, expression patterns are initially analyzed by lower 
resolution techniques, such as differential display, indexing, subtraction hybridization, 
or one of the numerous DNA fingerprinting techniques, e.g. Lingo et al. Science, 257: 
25 967-97 1 ( 1 992); Erlander et al. International patent application PCT/US94/1 304 1 ; 
McClelland et al, U.S. patent 5,437,975; Unrau et al. Gene, 145: 163-169 (1994); and 
the like. Sequence analysis is then frequently carried out on subsets of cDNA clones 
identified by application of such techniques, e.g. Linskens et al, Nucleic Acids 
Research, 23: 3244-3251 (1995). Such subsequent analysis is invariably carried out 
30 using conventional Sanger sequencing of randomly selected clones from a subset; 
thus, the scale of the analysis is limited by the Sanger sequencing technique. 

Recently, two techniques have been reported that attempt to provide direct 
sequence information for analyzing patterns of gene expression, Schena et al, Science, 
270: 467-469 (1995)(hybridizing mRNA to 45 expressed sequence tags attached to a 
35 glass slide) and Velculescu et al. Science, 270: 484-486 ( 1 995)(excision and 

concatenation of short tags adjacent to type lis restriction sites in sequences from a 
cDNA library, followed by Sanger sequencing of the concatenated tags). However, 
implementation of these techniques has only involved relative few sequences (45 and 
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30, respectively) so it is not clear whether they have the capability to track a more 
meaningful sample of expressed genes, e.g. Kollner et al. Genomics, 23: 1 85-191 
(1994). Without substantially larger sample sizes, the techniques will not be able to 
track changes in the transcript levels of low-expression genes. 
5 It is clear from the above that there is a crucial need both for higher throughput 

sequencing techniques that can reduce the time and effort required to analyze genome- 
sized DNAs and that can be applied to the analysis of large samples of sequences 
from complex mixtures of polynucleotides, such as cDNA libraries. The availability 
of such techniques would find immediate application in medical and scientific 
10 research, drug discovery, diagnosis, forensic analysis, food science, genetic 
identification, veterinary science, and a host of other fields. 

Summary of the Invention 
An object of my invention is to provide a new method and approach for 
1 5 determining the sequence of polynucleotides. 

Another object of my invention is to provide a method for rapidly analyzing 
patterns of gene expression in normal and diseased tissues and cells. 

A further object of my invention is to provide a method, kits, and apparatus for 
simultaneously analyzing and/or sequencing a population of many thousands of 
20 different polynucleotides, such as a sample of polynucleotides from a cDNA library or 
a sample of fragments from a segment of genomic DNA. 

Still another object of my invention is to provide a method, kits, and apparatus 
for identifying populations of polynucleotides. 

Another object of my invention is to provide a method for sequencing 
25 segments of DNA in a size range corresponding to typical cosmid or YAC inserts. 
My invention achieves these and other objectives by providing each 
polynucleotide of a population with an oligonucleotide tag for transferring sequence 
information to a lag complement on a spatially addressable array of such 
complements. That is, a unique tag is attached to each polynucleotide of a population 
30 which can be cleaved or copied and used to shuttle sequence information to its 
complement at a fixed position on an array of such complements. After a tag 
specifically hybridizes with its complement, a signal is generated that is indicative of 
the transferred sequence information. Preferably, the invention is carried out by 
repeated cycles of amplifying the polynucleotides, identifying one or more nucleotides 
35 . at an end of each polynucleotide by use of the oligonucleotide tags, and shortening the 
polynucleotides by removing one or more nucleotides. 

At least two major advantages are gained by using tags to shuttle information 
to discrete spatial locations rather than sorting an entire population of polynucleotides 
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to such locauons: First, tags are much smaller molecular entities so that the kinetics 
of diffusion and hybridization are much more favorable. Second, tag loading at the 
spatially discrete locations only need be sufficient for detection, while polynucleotide 
loading would need to be sufficient for both biochemical processing and detection; 
5 thus, far less tag needs to be loaded on the spatially discrete sites. 

An important aspect of the invention is the attachment of an oligonucleotide 
tag to each polynucleotide of a population such that substantially all different 
polynucleotides have different tags. As explained more fully below, this is achieved 
by taking a sample of a full ensemble of tag-polynucieotide conjugates wherein each 
tag has an equal probability of being attached to any polynucleotide. The sampling 
step ensures that the tag-polynucleotide conjugate population will fulfill the above- 

stated condition that the tap ~r-.._u i.^- i . ... 

au^j, popuiatiOii be substantially 

unique. 

Complements of the oligonucleotide tags are preferably synthesized on the 
1 5 surface of a solid phase support, such as a microscopic bead or a specific location on 
an array of synthesis locations on a single support, such that populations of identical 
sequences are produced in specific regions. That is, the surface of each support, in 
the case of a bead, or of each region, in the case of an array, is derivatized by only one 
type of tag complement which has a particular sequence. The population of such 
20 beads or regions contains a repertoire of complements with distinct sequences. As 
used herein in reference to oligonucleotide tags and tag complements, the term 
"repertoire" means the set of minimally cross-hybridizing set of oligonucleotides that 
make up the tags in a particular embodiment or the corresponding set of tag 
complements. 

25 My invention provides a readily automated system for obtaining sequence 

information from large numbers of polynucleotides at the same time. My invemion is 
particularly useful in operations requiring the generation of massive amounts of 
sequence information, such as large-scale sequencing of genomic DNA fi-agments, 
mRNA and/or cDNA fingerprinting, and highly resolved measurements of gene 

30 expression patterns. 

Brief Description of the Drawing s 
Figure 1 is a flow chart iliastrating a general algSrithm for generating 
minimally cross-hybridizing sets. 

Figure 2a illustrates the major steps of the preferred embodiment of the 
35 method of the invention. 

Figure 2b illustrates the use of rolling primers in the method of the 
invemion. 
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Figure 2c illustrates the manner in which rolling primers are selected in 

successive cycles of amplification, tag transfer, and template mutation. 

Figtire 2d illustrates the use of S and T primers in one embodiment of 

the invention. 

5 Figure 3 diagrammatically illustrates an apparatus for detecting labeled 

tags on a spatially addressable array of tag complements. 

Definitions 

"Complement" or "tag complement" as used herein in reference to 

1 0 oligonucleotide tags refers to an oligonucleotide to which a oligonucleotide tag 

specifically hybridizes to form a perfectly matched duplex or triplex. In embodiments 
where specific hybridization results in a triplex, the oligonucleotide tag may be 
selected to be either double stranded or single stranded. Thus, where triplexes are 
formed, the term "complement" is meant to encompass either a double stranded 

1 5 complement of a single stranded oligonucleotide tag or a single stranded complement 
of a double stranded oligonucleotide tag. 

The term "oligonucleotide" as used herein includes linear oligomers of natural 
or modified monomers or linkages, including deoxyribonucleosides, ribonucleosides, 
-anomeric forms thereof, peptide nucleic acids (PNAs), and the like, capable of 

20 specifically binding to a polynucleotide by way of a regular pattern of monomer-to- 
monomer interactions, such as Watson-Crick type of base pairing, base stacking, 
Hoogsteen or reverse Hoogsteen types of base pairing, or the like. Usually monomers 
are linked by phosphodiester bonds or analogs thereof to form oligonucleotides 
ranging in size fi-om a few monomeric units, e.g. 3-4, to several tens of monomeric 

25 units. Whenever an oligonucleotide is represented by a sequence of letters, such as 
" ATGCCTG," it will be understood that the nucleotides are in 5'-^3' order from left to 
right and that "A" denotes deoxyadenosine, "C" denotes deoxycytidine, "G" denotes 
deoxyguanosine, and "T" denotes thymidine, unless otherwise noted. Analogs of 
phosphodiester linkages include phosphorothioate, phosphorodithioate, 

30 phosphoranilidate, phosphoramidate, and the like. It is clear to those skilled in the art 
when oligonucleotides having natural or non-natural nucleotides may be employed, 
e.g. where processing by enzymes is called for, usually oligonucleotides consisting of 
natural nucleotides are required. 

"Perfectly matched" in reference to a duplex means that the poly- or 

35 oligonucleotide strands making up the duplex form a double stranded structure with 
one other such that every nucleotide in each strand undergoes Watson-Crick 
basepairing with a nucleotide in the other strand. The term also comprehends the 
pairing of nucleoside analogs, such as deoxyinosine, nucleosides with 2-aminopurine 
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bases, and the like, that may be employed. In reference to a triplex, the term means 
that the triplex consists of a perfectly matched duplex and a third strand in which 
every nucleotide undergoes Hoogsteen or reverse Hoogsteen association with a 
basepair of the perfectly matched duplex. Conversely, a "mismatch" in a duplex 
5 between a tag and an oligonucleotide means that a pair or triplet of nucleotides in the 
duplex or triplex fails to undergo Watson-Crick and/or Hoogsteen and/or reverse 
Hoogsteen bonding. 

As used herein, "nucleoside" includes the natural nucleosides, including 2'- 
deoxy and 2 -hydroxyl forms, e.g. as described in Komberg and Baker, DNA 
1 0 Replication, 2nd Ed. (Freeman, San Francisco, 1992). "Analogs" in reference to 
nucleosides includes synthetic nucleosides having modified base moieties and/or 
modified sugar moieties, e.g. described by Scheit, Nucleotide Analogs (John Wiley, 
New York, 1980); Uhlman and Peyman, Chemical Reviews, 90: 543-584 (1990), or 
the like, with the only proviso that they are capable of specific hybridization. Such 
1 5 analogs include synthetic nucleosides designed to enhance binding properties, reduce 
complexity of probes, increase specificity, and the like. 

As used herein, "amplicon" means the product of an amplification reaction. 
That is, it is a population of identical polynucleotides, usually double stranded, that 
are replicated from a few starting sequences. Preferably, amplicons are produced in a 
20 polymerase chain reaction (PGR). 

As used herein, "complexity-reducing nucleotide" refers to a natural or non- 
natural nucleotide (i) that, when paired with either of more than one natural 
nucleotides, can form a duplex of substantially equivalent stability to that of the same 
duplex containing cognate natural nucleotide-i.e. the natural nucleotide it replaces, 
25 and (ii) that can be processed by enzymes substantially the same as its cognate natural 
nucleotide. Preferably, complexity-reducing nucleotides do not display degeneracy or 
ambiguity when processed by DNA polymerases. That is, when a complexity- 
reducing nucleotide is in a template that is being copied by a polymerase, the 
polymerase incorporates a unique nucleotide at the site of a complexity-reducing 
30 nucleotide. Likewise, when a complexity-reducing nucleotide triphosphate is a 

substrate for a DNA polymerase, it is incorporated only at the site of a single kind of 
nucleotide, i.e. one or another of its complements, but not both. Candidate 
complexity-reducing nucleotides are readily tested in straight forward hybridization 
assays, e.g. with melting temperature comparisons, and in incorporation assays in 
35 which test polymerizations are checked by conventional sequencing or by 

incorporation of radio-labeled complexity-reducing nucleotides, e.g. Bessman et al, 
Proc. Natl. Acad. Sci., 44: 633 (1958). Preferably, "substanUally equivalent stability," 
as used herein means that the melting temperature of a test 13-mer duplex, as 
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described in Kawase et al. Nucleic Acids Research, 14: 7727-7736 (1986), is within 
twenty percent of that of the same duplex containing a natural cognate nucleotide. 

Detailed Description of the Invention 
5 The invention provides a method of sequencing large numbers of 

polynucleotides in parallel by using oligonucleotide tags to shuttle sequence 
information obtained in "bulk** or solution phase biochemical processes to discrete 
spatially addressable sites on a solid phase. Signals generated at the spatially 
addressable sites convey the sequence information carried by the oligonucleotide tag. 

1 0 As explained more ftilly below, sequencing is preferably carried out by repeated 

cycles of amplifying the polynucleotides, identifying nucleotides, and shortening the 
polynucleotides. In the shortening cycles, a predetermined number of nucleotides 
(usually the nucleotides identified in the previous cycle) are cleaved from the 
polynucleotides and the shortened polynucleotides are employed in the next cycle of 

1 5 amplification, identification, and shortening. Since the oligonucleotide lags 

specifically hybridize to the same location on a spatially addressable array of tag 
complements, the sequence of a particular polynucleotide can be read by observing 
the signals generated from that location through successive cycles of the method. 
More particularly, the invention is carried out by the following steps: (a) 

20 attaching an oligonucleotide tag from a repertoire of tags to each polynucleotide of a 
population to form tag-polynucleotide conjugates such that substantially all different 
polynucleotides have different oligonucleotide tags attached; (b) providing a label for 
each oligonucleotide tag for identifying one or more terminal nucleotides of its 
associated polynucleotide; (c) transferring the labeled oligonucleotide tags or copies 

25 thereof onto a spatially addressable array of tag complements for sorting and specific 
hybridization of the oligonucleotide tags or copies thereof and detection of the labels. 
Preferably, the method further includes the steps of (d) shortening the polynucleotides 
so that further sequence information can be obtained and (e) repeating steps (b)-(d) 
until sufficient sequence information is accumulated for identification of the 

30 polynucleotides. Preferably, the step of transferring the oligonucleotide lags to the 
array of lag complements includes separating the oligonucleotide tags from the 
amplification reaction mixture, cleaving the oligonucleotide lags from their tag- 
polynucleotide conjugates, and applying the oligonucleotide tags to the array of tag 
complements. 

35 Preferably, the step of providing a label for the oligonucleotide tags is 

accomplished by selectively amplifying polynucleotides in a polymerase chain 
reaction. As illustrated in Figure 2a, in the preferred embodiment, a population of 
tagged polynucleotides (200) having flanking primer binding sites (12 & 22) is 
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aliquotted into four reaction vessels (28)-(34) containing common, or "T" primers 
(210), and "S." or second, primers which have defined 3' terminal nucleotides of A, 
C, G, and T, respectively (202)-(208). Preferably, a T primer carries a fluorescent 
label indicative of the defined 3' terminal nucleotide of the S primer in its reaction 
5 vessel. After amplification (207), an aliquot of the amplicons is taken (209) from 
each vessel, pooled, and the labeled oligonucleotide tags are prepared for transferring 
to a spatially addressable array (48). Separately, aliquots are also taken from each 
vessel, pooled, and shortened (213) by mutating (if necessary) or cleaving the 
polynucleotide to remove the nucleotide(s) just identified. The shortened 
1 0 polynucleotides are then re-aliquoned (2 1 4) to four reaction vessels for the next cycle 
of selective amplification, transfer of labeled tags, and shortening. In successive 
cycles, tag complement locations (216) receiving labeled oligonucleotide tags will 
successively generate signals indicative of the defined 3' terminal nucleotides of the S 
primers. 

1 5 In one embodiment, the identity of the one or more terminal nucleotides is 

determined by selectively amplifying correct sequence primers in a polymerase chain 
reaction (PCR) employing primers whose 3' terminal sequences are complementary to 
every possible sequence of the one or more terminal nucleotides whose identity is 
sought. Thus, when the identity of a single terminal nucleotide is sought, four 
20 separate polymerase chain reactions may be carried out with one primer identical in 
each of the four reactions, but vAxh each of the other four primers having a 3' terminal 
nucleotide that is either A, C, G, or T. As used herein, this terminal nucleotide is 
referred to as a "defined 3' terminal nucleotide." The defined 3* terminal nucleotide is 
positioned so that it must be complementary to the terminal nucleotide of the 
25 polynucleotide for amplification to occur. Thus, the identity of the primer in a 
successful amplification gives the identity of the terminal nucleotide of the target 
sequence. This information is then extracted in parallel from the population of 
polynucleotides by detaching or copying the amplified tags and sorting them onto 
their tag complements on a spatially addressable array. After amplification and 
30 identification, the polynucleotides are shortened by removing one or more terminal 
nucleotides, for example, by cleavage with a type Us restriction endonuclease. By 
_ repeating this process for successive nucleotides the sequences of a population of 
polynucleotides are determined in parallel. 

In another embodiment, the step of shortening is accomplished by advancing a 
35 primer along the polynucleotide templates by template mutation. An important 

feature of this embodiment is providing a set of primers, referred to herein as "rolling 
primers" that contain complexity-reducing nucleotides for reducing the number of 
primers required for annealing to every possible primer binding site on templates 
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formed from the polynucleoiides. Another important feature of this embodiment is 
the systematic replacement of at least one of the four nucleotides in the polynucleotide 
with its cognate complexity-reducing nucleotide or complement thereof Sequencing 
is initiated by annealing rolling primers differing only in their terminal nucleotides to 
5 a primer binding site of the polynucleotide templates so that only the rolling primer 
whose terminal nucleotide forms a perfect complement with the template leads to the 
formation of an extension product. After amplifying the double stranded extension 
product to form an amplicon, the terminal nucleotide, and hence its complement in the 
template, is identified by the identity of the amplicon. For example, in a fonm of this 

1 0 embodiment, a terminal nucleotide may be identified by the presence or absence of 
amplicon in four vessels that are used for separate extension and amplification 
reactions. The primer binding site of the template of the successfully amplified 
polynucleotide is then mutated by, for example, oligonucleotide-directed mutagenesis 
so that a subsequent rolling primer may be selected from the set that forms a perfectly 

1 5 matched duplex with the mutated template at a site which is shifted towards the 
direction of extension by one nucleotide relative to the binding site of the previous 
rolling primer. The steps of selective extension, amplification and identification are 
then repeated. In this manner, the primers "roll" along the polynucleotide during the 
sequencing process, moving a base at a time along the template with each cycle. 

20 Preferably, in the rolling primer embodiment the tag-polynucleotide conjugate is 
shortened by a single nucleotide in each cycle of steps. 

The preferred embodiments and common elements of the invention are 
described in more detail below. 

Rolling Primers 

25 Preferably, rolling primers are from 15 to 30 nucleotide in length and have the 

following form: 

X1X2 ...2CkYY ... YN 

where the Xj's are nucleotides, preferably arranged in repetitive subunits; Y's are 
30 complexity-reducing nucleotides or their complements; and N is a terminal nucleotide 
of either A, C, G, or T, or a complexity -reducing nucleotide, such as deoxyinosine. 
The segments of Xj nucleotides, referred to herein as the "template positioning 
segments," are preferably arranged in repetitive subunits so that the primer is properly 
registered on the primer binding site with the terminal nucleotide juxtaposed with the 
35 first nucleotide of polynucleotide. Preferably, the repeat subunit is long enough so 
that if the primer is out of register by one or more repeat subunits, it will be too 
unstable to remain annealed to the template. Preferably, the repeat subunit is from 4 
to 8 nucleotides in length. As will become more apparent below, arranging the 
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template positioning segment as a series of identical subuniis reduces the overall 
number of primers required in a set of rolling primers. Preferably, the template 
positioning segments are selected from a group of no more than two nucleotides, at 
least one of which is a complement of a complexity-reducing nucleotide being 
5 employed. In preferred embodiments, the underlined Xk indicates the position at 
which the template is mutated by way of oligonucleotide-directed mutagenesis, e.g. a 
technique fully described in Current Protocols in Molecular Biology (John Wiley & 
Sons. New York, 1995). 

The segment YY ... YN is referred to herein as the "extension region" of the 

1 0 primer, as the primer is extended from this end along the template. Preferably, 

extension is carried out by a polymerase so that YY ... YN is in a 5'->3' orientation. 
However, the orientation could be 3'->5' with other methods of extension, e.g. by 
ligating oligonucleotide blocks as described U.S. patent 5,1 14,839. An important 
feature of the invention is that extension only take place when the terminal nucleotide, 

1 5 N. forms a Walson-Crick base pair with the adjacent nucleotide in the template. The 
extension region comprises the minimal number of nucleotides greater than two that 
can form a stable duplex with the template, even if there is a mismatch at the Xk 
position. That is, in the preferred embodiments, the duplex between the extension 
region and the template must be stable enough to carry out the oligonucleotide- 

20 directed mutagenesis. Preferably, the extension region comprises from 3 to 6 
nucleotides, and most preferably, it comprises 4 nucleotides. Preferably, Y is 
selected from the group consisting of deoxyadenosine (A) and deoxyinosine (I). 

The number of rolling primers required for a particular embodiment depends 
on several factors, including the type of complexity-reducing nucleotides employed, 

25 the length of the primer, the length of the extension region, and the repeat subunii 

length of the template positioning segment. For example, the following set of primers 
(SEQ ID NO: 1-6) has a template positioning segment 1 8 nucleotides in length made 
up of subunits of G's and A's 6 nucleotides in length. 



30 Subgroup Rolling Primer Sequence 

( 1 ) GGAAGAGGAAGAGGAAGAYYYN 



- (2) - - - GAAGAGGAAGAGGAAGAGYYYN " " 

( 3 ) AAGAGGAAGAGGAAGAGGY YYN 

( 4 ) AGAGGAAGAGGAAGAGGAYYYN 
35 ( 5 ) GAGGAAGAGGAAGAGGAAYYYN 

( 6 ) AGGAAGAGGAAGAGGAAGY Y YN 
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If Y is A or I and N is A, C, I, or T, then the above set of rolling primers includes 1 92 
(=6 X 2^ X 4) primers. In particular, each "YYY" represents all of the following 
sequences: AAA, AAl, All, AlA, lAI, lAA. IIA, and III. As can be seen from the 
above example, a template positioning segment is available for shifting the primer one 
5 nucleotide in the direction of extension after any cycle. That is, if a primer from 
subgroup (5) were employed in a cycle, the next primer employed would be selected 
from subgroup (6), if a primer from subgroup (6) were employed in a cycle, the next 
primer employed would be selected from subgroup (1), and so on. When PCR is used 
to copy and amplify the template, the template is, in effect, shortened by one 
10 nucleotide in each cycle. 

Alternatively, the binding strength of the extension region can be improved by 
substituting G for I and diaminopurine (D) for A in all positions, except those 
immediately adjacent to the terminal nucleotide. That is, an alternative set of "YYY" 
sequences include DDA, DDI, DGI, DGA, GDI, GDA, GGI, and GGA. 

15 

Sequencing with Rolling Primers 
Generally, this aspect of my invention is carried out with the following steps: 
(a) providing a set of primers, i.e. the rolling primers, each primer of the set having an 
extension region comprising one or more complexity-reducing nucleotides and a 

20 terminal nucleotide; (b) forming a template comprising a primer binding site and the 
polynucleotide whose sequence is to be determined, the primer binding site being 
complementary to the extension region of at least one primer of the set; (c) annealing 
a primer from the set to the primer binding site, the extension region of the primer 
forming a perfectly matched duplex with the template and extending the primer to 

25 form a double stranded DNA; (d) amplifying the double stranded DNA to form an 

amplicon; (e) identifying the terminal nucleotide of the extension region of the primer 
by the identity of the amplicon; (f) mutating the primer binding site of the template so 
that the primer binding site is shifted one or more nucleotides in the direction of 
extension, thereby effectively shortening the polynucleotide by one or more 

30 nucleotides; and (g) repeating steps (c) through (f> until the nucleotide sequence of the 
polynucleotide is determined. 

" Prior to sequencing, a polynucleotide is treated so that one or more kinds of 
nucleotide are substituted with their cognate complexity-reducing nucleotides. In a 
preferred embodiment, this is conveniently accomplished by replicating the 

35 polynucleotide in a PCR wherein dGTP is replaced with dITP. A template for 

sequencing is then prepared by joining the polynucleotide to a primer binding site. 
Typically, this is accomplished by inserting the polynucleotide into a vector which 
carries the primer binding site. Preferably, the primer binding site is in the 3' direction 
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relative to the polynucleotide so that primer extensions can be carried out with a DNA 
polymerase. Such insertion is conveniently carried out using a bJunt-end-cutting 
restriction endonuclease, such as Stu I or Eel 136 II, if the rolling primers described 
above are employed. These enzymes leave a three-base sequence adjacent to the 
5 beginning of the polynucleotide that is complementary to the primers described above. 
Preferably, a primer, referred herein as the 'T" primer, is located at the other end of 
the polynucleotide so that it can be amplified by PCR. For example, sequencing can 
be initiated on such a template in four separate reactions as shown below, assuming 
the use of the primers (SEQ ID NO: 1-6) described above. 



10 



25 



Reaction 1 GGAAGAGGAAGAGGAAGAAI IA-> 



. . . CCTTCTCCTTCTCCTTCTTCCNNNN . . . NNNBBBB . . . BB 

15 Reaction 2 GGAAGAGGAAGAGGAAG AAI I C - > 

. , . CCTTCTCCTTCTCCTTCTTCCNNNN . . . NNNBBBB . , . BB 

Reaction 3 GGAAGAGGAAGAGGAAGA AI I I - > 

20 ... CCTTCTCCTTCTCCTTCTTCCNNNN . , . NNNBBBB ... BB 



Reaction 4 GGAAGAGGAAGAGGAAG AAI IT - > 

. . . CCTTCTCCTTCTCCTTCTTCCNNNN . . . NNNBBBB . . . BB 



where "NNNN ... NNN" represents the polynucleotide and "BBBB ... BB" represents 
the complement of a T primer binding site for amplifying the sequences by PCR. The 
underlined sequences indicate the extension regions of the rolling primers. The 

30 template positioning segment of the primers was arbitrarily chosen to correspond to a 
primer from subgroup (1 ) described above. If it is assumed-to illustrate the method- 
that the sequence of the polynucleotide adjacent to the rolling primer binding site is 
"TAIC," then only Reaction 1 will result in the fonnation of an amplicon, and the first 
nucleotide of the polynucleotide is identified as T. Preferably, prior to amplification, 

35 the primer is extended with a high fidelity DNA polymerase, such as Sequenase, in 
the presence of dATP, dCTP, dITP, and dTTP in the preferred embodiments. It 
should be understood that selective extension may also be carried out in a single 
vessel, for example, if labeled primers are employed and the extension products are 
separated from the primers that fail to extend. The important feature is that only 

40 primers whose terminal nucleotide forms a correct Watson-Crick basepair with the 
template are extended. Preferably, after extension, any single stranded DNA in the 
reaction mixture is digested with a single stranded nuclease, such as Mung bean 
nuclease. After such extension and digestion, the remaining double stranded DNA is 
then amplified, again in the presence of dATP, dCTP, dITP, and dTTP in the 
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preferred embodiments, lo produce an amplicon. Preferably, this amplification is 
accomplished by 5-10 cycles of PCR so that there is little or no likelihood of 
anomalous amplification products being produced. 

Samples of the amplicon from Reaction 1 are removed and aliquotted into four 
new vessels containing following primers from subgroup (2)(SEQ ID NO: 2): 



10 



Reaction 5 



Reaction 6 



GAAGAGGAAGAGGAAGAG I lAA - > 
CCTTCTCCTTCTCCTTCTTCCTNNN . . . NNNBBBE . . . BB 



GAAGAGGAAGAGGAAGAG I I AC - > 
CCTTCTCCTTCTCCTTCTTCCTNNN . . . NNNBBBB . . . BB 



15 



Reaction "? 



GAAGAGGAAGAGGAAGAG I I AI - > 
CCTTCTCCTTCTCCTTCTTCCTNNN 



. NNNBBBE 



BB 



Reaction 8 GAAGAGGAAGAGGAAGAG I I AT - > 

20 ... CCTTCTCCTTCTCCTTCTTCCTNNN . . . NNNBBBB . . . BB 

Since the first nucleotide of the polynucleotide was determined in the previous cycle, 
one selects primers from subgroup (2) whose extension regions have the form "IIAN," 
as shown. This creates a mismatch at the underlined T in the lower strands, which is 

25 mutated to C in any amplicon produced by oligonucleotide-directed mutagenesis. 
That is, the primer is the oligonucleotide directing the mutation of the site in the 
amplicon. Thus, the "T" is converted into a "C" in the amplicons. Since the second 
nucleotide of the target is A, both Reactions 7 and 8 lead to the production of 
amplicons. Either amplicon may be sampled for the next cycle since only a single 

30 polynucleotide is presently being considered. As explained more ftilly below, an 
additional "pooling'* step must be carried out when multiple polynucleotides are 
simultaneously sequenced. 

As before, samples of one of the two amplicons are distributed into four new 
vessels containing primers from subgroup (3) (SEQ ID NO: 3) with an extension 

35 region having the form "IAIN". 
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Reaction 9 AAGAGGAAGAGGAAGAGG lAIA- > 

. . . CCTTCTCCTTCTCCTTCTCCCTANN . . . NNNB3BB . . . EB 

5 Reaction 10 AAGAGGAAGAGGAAGAGG I AIC '> 

. . . CCTTCTCCTTCTCCTTCTCCCTAhJN . . . NNNBBBB . . . BB 

Reaction 11 AAGAGGAAGAGGAAGAGG I AI 1 - > 

10 ... CCTTCTCCTTCTCCTTCTCCCTANN ... NNNBBBB . . . BB 

Reaction 12 AAGAGGAAGAGGAAGAGGIAIT-> 



15 



20 



. . . CCTTCTCCTTCTCCTTCTCCCTANN . . . NNNBBBB . . . BB 

Both Reactions 9 and 10 will produce amplicons; thus, the third base is identified as 
an "I." For the next cycle, this then leads to the selection of primers from subgroup 
(4) having an extension region with the form "ALAN," and the process is continued. 



Sequencing Tagged Polynucleotides Using Rolling Primers 
A preferred embodiment for simultaneously sequencing a population of tagged 
polynucleotides is diagrammed in Fig. 2b. Preferably, the population of tagged 
polynucleotides is amplified from a vector as described above in the presence of 

25 dATP, dCTP, dITP, and dTTP to give a population of double stranded DNAs (10) 
containing T primer binding site (12), cleavage site (14)-- which as shown below is 
optional, tag ( 1 6), cleavage site ( 1 8), polynucleotides (20), and rolling primer binding 
site (22). The population of double stranded DNAs (1 0) is methylated so that when 
tags are later excised only the double stranded DNAs that have been selectively 

30 amplified-and therefore lack methylated bases-will be cleaved to yield tags for 
detection. 

In the initial population, rolling primer binding site (22) contains a known 
complement to the extension region (24), for example, AGG as shown in the example 
below. Samples of the initial population are preferably transfened (26) to four 

35 separate vessels (28-34) where they are combined with the rolling primers of 

subgroup (1), described above, having extension regions -AIIA, -AIIC, -AUG, and - 
AIIT. (The four rolling primer could be placed in a single vessel and allowed to 
c .:..npete against one another for"extension;"however. errors are less likely if the 
primers are used separately). The rolling primers of subgroups (1 )-(6) are used here to 

40 exemplify the invention. Clearly, many alternative forms of the rolling primers could 
be used. In subsequent cycles, as described more fully below, the transferring step 
(26) becomes more complex because more than four vessels, i.e. up to 32 (=4 x 8) in 
the embodiment exemplified here, are required for the extension reactions. After the 
double stranded DNAs (10) are combined with the appropriate rolling primers the 
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following steps (36) are taken: the double stranded DNAs are denatured, e.g. by 
heating; the temperature is lowered to permit the rolling primers to anneal to the 
rolling primer binding sites; the primers are extended with a high fidelity DNA 
polymerase, such as Sequenase, in the presence of dATP, dCTP, dITP, and dTTP: 
5 preferably, any remaining single stranded DNA is digested, e.g. with a single stranded 
nuclease, such as Mung bean nuclease, to reduce the likelihood of interference from 
the left over single stranded DNA in the subsequent amplification; T primer is added; 
and the double stranded extension products are amplified, preferably with 5-10 cycles 
of PCR, to generate amplicon A (38), amplicon C (40), amplicon G (42), and 
1 0 amplicon T (44), respectively. 

After a sample is taken from each amplicon, tags are excised by way of 
cleavage sites (14) and (18) and labeled (46), as described more ftilly below. The 
labeled tags are then either applied separately to their tag complements on solid phase 
support (48) or pooled and applied to the support, depending on the labeling system 
1 5 employed, the complexity of the tag mixture, and like factors. Samples of the 

amplicons are also taken for further processing (50-56) in accordance with the method 
of the invention. Depending on the identity of the most recently determined 
nucleotide and the identity of the current extension region, a sample either may be 
separately aliquotted into vessels with rolling primers for the next cycle, or a sample 
20 may be combined with one or more other samples and aliquotted into vessels with 
rolling primers for the next cycle. Unlike the single polynucleotide case, when a 
population of polynucleotides is sequenced every vessel will almost always contain an 
amplicon at the conclusion of the amplification reaction. Thus, after extension, 
digestion, and amplification, the amplicons in the vessels 28, 30, 32, and 34 
25 correspond to polynucleotides having a T, G (or I), C, and A at their initial positions 
(or more generally, at the nucleotide position adjacent to the rolling primer binding 
site), respectively. With this information, and a knowledge of the sequence of the 
extension region of the current amplicon, the rolling primers of the next cycle can be 
selected. As in the single polynucleotide case, in each successive cycle a rolling 
30 primer is selected that shifts, or advances, the rolling primer binding site one or more 
nucleotides along the template in the direction of rolling primer extension. 
Preferably, a single nucleotide shift takes place in each cycle. As described above, the 
rolling primers selected for the extension step also serve to generate a mutation in the 
template upon amplification. The mutation changes the interior-most nucleotide of 
35 the extension region to one that is complementary to the template positioning segment 
of the rolling primer of the current cycle. In the tables below, the pattern of primer 
selection and amplicon pooling in cycles 2 through 4 of a sequencing operation is 
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illustrated for the above embodiment. In the first cycle, the original template is 
distributed to four vessels for denaturation and extension. 



Ampticon 



Seiection of Rolling Primers for 2nd Cycle 



Sequence 
adjacent to primer 
binding site 



IIAIA 
CCTIT 



IIAIC 
CCT) I 



-> 



Extension 
region of next 
rolling primer 



-> lAA 



Extension 
regions of 
merged or 
combined 
primers 

/ -lAAA 
-lAAC 
-lAAG 
-lAAT 



IIAIG ... 
CCT I C ... 



IIAIT ... 
CCT I A . . . 



-> lAI 



-> lAI 




10 The nucleotide to the right of the line between nucleotides in the second column is the 
terminal nucleotide of the rolling primer used to produce the amplicon. Generally, the 
algorithm for determining the rolling primers of the next cycle is as follows: (i) drop 
the nucleotide distal to the terminal nucleotide in the extension region of the current 
rolling primer (the leftmost "I" of the "IIA" sequences in the second column), (ii) 

1 5 determine which nucleotide, I or A, is complementary to the nucleotide paired with 
the terminal nucleotide (i.e. for the above example: "A*' for amplicon A, "A" for 
amplicon C-since A will pair with I as well as T, "I" for amplicon G--since I will pair 
with C, and "I" for amplicon T-since I will also pair with A), (iii) insert the 
_ determined nucleotide, I or A, to the left of the terminal nucleotide. For this 
20 embodiment, the general pattern of transitions between extension region sequences is 
illustrated in Figure 2b. Longer extension regions lead to more complex patterns, but 
the basic algorithm defining permissible transitions remains the same. 



25 
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Seleceion of Rolling Primers for 3rd Cycle 



AmpHcon 



Ai 
A2 

Ci 
C2 



Sequence ad|acent 
to primer binding 
^rte 



lAAIA 
CTTIT 

lAI lA 
CTCIT 



lAAIC 
CTTI I 

lAI IC 
CTC I 1 



-> 



Extension 
region of next 
rolling primer 



AAA 



AIA 



AAA 



AIA 



Extension 
regions of 
merged or 
combined 
primers 




G1 
G2 



lAAIG 
CTTIC 

lAIjG 
CTCIC 



lAAIT 
CTTI A 

lAI IT 
CTC I A 



AAI 



All 



-> 



AAI 



All 




-AHA 
-AIIC 
-AUG 
-AIIT 



10 



15 
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Selection of Rolling Primers for 4th Cycle 



Amplicon 



A1 
A2 
A3 
A4 

C1 
C2 
C3 
C4 
Gl 
G2 
G3 
G4 
T1 
T2 
T3 
T4 



Sequence adjacent 
to primer binding 
site 



AAAIA 
TTT IT 

AIM A 
TCTIT 

AAIIA 
TTCIT 

AIKA 
TCCIT 



AAAIC 
TTT! I 

AIAIC 
TCTII 

AAHC 
TTCII 

AIIIC 
TCCtI 



AAAIG 
TTTIC 

AIAIG 
TCTIC 

AAI |G 
TTCIC 

All IG 
TCCIC 



AAAIT 
TTT I A 

AIAiT 
TCTIA 

AAI IT 
TTCIA 

All IT 
TCCIA 



-> 



-> 



-> 



-> 



Extension 
region of next 
rolling primer 



AAA 



lAA 



AIA 



IIA 



AAA 



lAA 



AIA 



-> 



AAI 



lAI 



All 



III 



AAI 



lAI 



All 



III 



Extension 
regions of 
merged or 
combined 
primers 



AAAA 
-AAAC 

AAAG 
-AAAT 




Aaaia 

-AAIC 
-AAIG 
{ -AAIT 




Typically, by the eighth cycle thirty-two reactions are required, and continue to be 
required, in each cycle until sequencing is halted. 
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Clearly, additional steps to those outlined above may be implemented, for 
example, to separate the initial extension product from extraneous single stranded 
DNA and/or the single stranded nuclease, if one is employed. Manipulation of 
polynucleotides and other reagents, temperature control for PCRs, and the like, may 
5 be carried out on commercially available laboratory robots, e.g. Biomek 1000 
(Beckman Instruments, Fullerton, CA). 

Rolling primers and T primers may be constructed to have a double stranded 
segment capable of binding to an anchored single stranded oligonucleotide via triplex 
formation for separation, e.g. as taught by Ji et al, Anal. Chem. 65 : 1 323-1 328 ( 1 993); 
10 Cantor et al, U.S. patent 5,482,836; or the like. Thus, for example, magnetic beads 
carrying such a single stranded oligonucleotide can be used to capture the amplicons 
and transfer them to a separate vessel containing a nuclease to cleave the tag. e.g. at 
cleavage site 18. of those double stranded DNAs that have been selectively amplified 
(other DNAs remain unamplified and therefore hemi-methylated so no cleavage 
1 5 , occurs). Preferably, the T primer contains a 5' biotin which permits the released lag to 
be captured and conveniently labeled. After capture, e.g. via avidinated magnetic 
beads, the 3' strands of the double stranded segment are stripped back to the tag by the 
use of T4 DNA polymerase, or like enzyme, in the presence of a deoxynucleoside 
triphosphate (dNTP) corresponding to the nucleotide flanking the tag. Thus, provided 
20 that the flanking nucleotides are not present elsewhere along the strand to the 3' ends, 
the 3*->5' exonuclease activity of the polymerase will strip back the 3' strand to the 
flanking nucleotides, at which point an exchange reaction will be initiated that 
prevents further stripping past the flanking nucleotides. The 3* ends of the tag can 
then be labeled in an extension reaction with labeled dNTPs. After labeling the non- 
25 biotinylated strand can be removed by denaluration and applied to the spatially 
addressable array for detection. 

After the labeled tags are hybridized to their tag complements and detected, 
the tags are removed by washing so that labeled tags from the next set of amplicons 
can be applied. 

30 

Sequencing Tagged Polynucleotides with S and T Primers 
In a preferred embodiment of the invention, the identification of the terminal 
nucleotides of polynucleotide inserts is accomplished by selectively amplifying 
sequences that form perfectly matched duplexes with the 3' ehd of the S primers. As 
35 illustrated in Fig. 2d. in each identification cycle multiple sets of S primers are used 
with the same T primer to produce PCR amplicons. Segments of cloning vector 41 0 
containing T primer binding site 412, cleavage site 414, tag 416, cleavage site 418, 
polynucleotide 420, and S primer binding site 422 are amplified in separate PCRs for 
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each of the different S primers, which in Fig, 2d number 4k. The S primers of sets I 
through k form duplexes with the S primer binding site and I to k terminal 
nucleotides of the polynucleotide, respectively. In Fig. 2d, '*N" represents a mixture 
of the four natural nucleotides, A, C, G, and T. Thus, the S primers of set 2 are each 
5 mixtures of 4 primers: one having a 3' terminal A and a penultimate nucleotide of A, 
C, G, or T; one having a 3' terminal C and a penultimate nucleotide of A, C, G, or T; 
one having a 3' terminal G and a penultimate nucleotide of A, C, G, or T; and one 
having a 3' terminal T and a penultimate nucleotide of A, C, G, or T. In sets 3 through 
k, "B" represents a complexity-reducing analog, or a mixture of such analogs and 
10 natural nucleotides. For example, B could consist of C and deoxyinosine. Such 
analogs are well known and are describe in Kong Thoo Lin et al, Nucleic Acids 
Research, 20: 5149-5152; U.S. patent 5,002,867; Loakes et al, Nucleic Acids 
Research, 22: 4039-4043 (1994); Nichols et al. Nature, 369: 492-493 (1994); and like 
references. The penultimate N's in S primers 2 through k could.also be complexity 
1 5 reducing analogs, provided that the presence of such an analog had no effect on the 
base pairing or extension of the terminal nucleotide. Indeed, it is not crucial that an S 
primer form a perfectly matched duplex with its binding site and polynucleotide over 
the primer's entire length. It is only required that the terminal nucleotide base pair 
correctly. Preferably, the S and T primers have approximately equal melting and 
20 annealing temperatures, and are in the range of 12 to 30 nucleotides in length. 

After carrying out the PCRs, k sets of 4 amplicons are produced. The 4k 
amplicons carry the following information: In set 1, amplicons made with S primers 
having a terminal A indicate polynucleotides whose first nucleotide is T; amplicons 
made with S primers having a terminal C indicate polynucleotides whose first 
25 nucleotide is G; and so on. In set 2, amplicons made with S primers having a terminal 
AN indicate polynucleotides whose second nucleotide is T; amplicons made with S 
primers having a terminal CN indicate polynucleotides whose second nucleotide is G; 
and so on. Likewise, for amplicons of sets 3 through k, the identities of the 
nucleotides of the 3rd through kth positions are indicated, respectively. To extract 
30 this information, the tags in the amplicons must be excised, labeled, rendered single 
stranded, and hybridized to their tag complements on a spatially addressable array. 
Tags 4 16 are excised by cleaving with restriction endonucleases directed to sites 41 4 
and 418. Preferably, the T primer is constructed to have a means for isolating the 
amplicons, such as a biotin moiety or a double suanded segment which can form a 
35 triplex structure with an anchored single stranded oligonucleotide. Preferably, after 
separation, the tags of each of the four amplicons of each set are separately labeled, 
e.g. with a spectrally resolvable fluorescent dye. Thus, all the tags in amplicon made 
fi-om S primers terminating in A have the same label which can be distinguished from 
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the labels for C, G, and T. The same holds for lags in amplicons made from S primers 
terminating in C, G, and T, regardless of the set number. Thus, when the tags of the 
nth set are simultaneously applied to the spatially addressable sites, the identity of the 
nucleotides at the nth position of all the polynucleotides are determined. 
5 Tags can be labeled in a variety of ways, including the direct or indirect 

attachment of radioactive moieties, fluorescent moieties, colorimetric moieties, 
chemiluminescent markers, and the like. Many comprehensive reviews of 
methodologies for labeling DNA and constructing DNA probes provide guidance 
applicable to labelling tags of the present invention. Such reviews include Kricka, 

10 editor, Nonisotopic DNA Probe Techniques (Academic Press, San Diego, 1992); 
Haugland, Handbook of Fluorescent Probes and Research Chemicals (Molecular 
Pj-qK^c^ Inc., Eugene, 1992); Keller and Manak, DNA Probes, 2nd Edition (Stockton 
Press, New York, 1993); and Eckstein, editor, Oligonucleotides and Analogues: A 
Practical Approach (IRJL Press, Oxford, 1991); Kessler, editor. Nonradioactive 

1 5 Labeling and Detection of Biomolecules (Springer-Verlag, Berlin, 1 992); and the 
like. 

Preferably, the tags are labeled with one or more fluorescent dyes, e.g. as 
disclosed by Menchen et al, U.S. patent 5,188,934; and Begot et al International 
application PCT/US90/05565. 

20 Preferably, the S primers are constructed to have a double stranded segment 

capable of binding to an anchored single stranded oligonucleotide for separation, e.g. 
as taught by Ji et al. Anal. Chem. 65: 1323-1328 (1993). Thus, for example, magnetic 
beads carrying such a single stranded oligonucleotide can be used to capture the 
amplicons and transfer them to a separate vessel containing a nuclease to cleave the 

25 tag, e.g. at cleavage site 418. Preferably, the T primer contains a 5' biotin which 

penriits the release tag to be captured and conveniently labeled. After capture, e.g. via 
avidinated magnetic beads, the 3' strands of the double stranded segment are stripped 
back to the tag by the use of T4 DNA polymerase, or like enzyme, in the presence of a 
deoxynucleoside triphosphate (dNTP) corresponding to the nucleotide flanking the 

30 tag. Thus, provided that the flanking nucleotides are not present elsewhere along the 
strand to the 3' ends, the 3*->5* exonuclease activity of the polymerase will strip back 
the 3' strand to the flanking nucleotides, at which point an exchange reaction will be 
initiated that prevents ftirther stripping past the flanking nucleotides. The 3' ends of 
the tag can then be labeled in an extension reaction with labeled dNTPs. After 

35 labelling the non-biotinylated strand can be removed by denaturation and applied to 
the spatially addressable array for detection. 

After the labeled tags are hybridized to their tag complements and detected, 
the tags are removed by washing so that labeled tags from the next set of amplicons 
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can be applied. As long as the amplicon number from which the tags arose are kepi 
track of, the ordering of hybridizations is not crucial. Preferably, hybridizations are 
done in the same order as the order of the corresponding nucleotide in the target 
sequence. 

5 The extent to which the S primers can overlap the polynucleotide for 

identifying successive nucleotides is limited by the degeneracy, or complexity, of the 
primer mixture as the overlap increases. This difilculty is addressed by periodically 
cleaving off the identified nucleotides from the polynucleotide, and then starling the 
identification cycle over again on the shortened polynucleotide. Such cleavage is 

10 effected by providing an adaptor with an S primer binding region having a recognition 
site of a nuclease that has a cleavage site separate from its recognition site. The 
recognition site of the nuclease is positioned so that it cleaves the polynucleotide a 
predetermined number of nucleotides from the border of the S primer binding site. 
Such a nuclease is referred to herein as a "stepping" nuclease. Preferably, a type Us 

1 5 restriction endonuclease is employed for as a stepping nuclease in the invention. Prior 
to cleavage the polynucleotide must be treated, e.g. by methylation, to prevent 
fortuitous cleavage because of internal recognition sites of the stepping nuclease being 
employed. After cleavage, the strand bearing the S primer binding site is removed, 
e.g. via triplex capture on magnetic beads, and the adaptor containing a replacement S 
20 primer binding site (and stepping nuclease recognition site) is ligated to the remaining 
strand. The adaptor contains a degenerate protruding strand that is compatible with 
the cleavage produced by the stepping nuclease. For example, if the type lis nuclease 
Bbv I is employed, the adaptor may have the following structure (SEQ ID NO: 7): 

25 5'- AAAGAAAGGAAGGGCAGCT 

TTTCTTTCCTTCCCGTCGANNNN 

where N is defined as above and the underlining indicates the location of a Bbv I 
recognition site. 

50 A stepping nuclease employed in the invention need not be a single protein, or 

consist solely of a combination of proteins. A key feature of the stepping nuclease, or 
of the combination of reagents employed as a stepping nuclease, is that its (their) 
cleavage site be separate from its (their) recognition site. The distance between the 
recognition site of a stepping nuclease and its cleavage site will be referred to herein as 

35 its "reach." By convention, "reach" is defined by two integers which give the number 
of nucleotides between the recognition site and the hydrolyzed phosphodiester bonds of 
each strand. For example, the recognition and cleavage properties of Fok I is typically 
represented as "GGATG(9/13)" because it recognizes and cuts a double stranded DNA 
as follows (SEQ ID NO: 8): 
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5 ' - . . . NNGGATGNNNNNNNNN NNNNNNNNNN . . . 

3 • - . , . NNCCTACNNNNNNNNNNNNN NNNNNN . . . 

5 where the bolded nucleotides are Fok Ys recognition site and the N's are arbitrary 
nucleotides and their complements. 

Preferably, stepping nucleases employed in the invention are natural protein 
endonucleases (i) whose recognition site is separate from its cleavage site and (ii) 
whose cleavage results in a protruding strand on the polynucleotide. Most preferably, 

10 type lis restriction endonucleases are employed as stepping nucleases in the invention, 
e.g. as described in Szybalski et al, Gene, 100: 13-26 (1991); Roberts et a], Nucleic 
Acids Research, 21: 3125-3137 (1993); and Livak and Brenner, U.S. patent 5,093,245. 
. Exemplary type lis nucleases include Alw XI, Bsm Ai, Bbv L Bsm FI, Sts I, Hga K Bsc 
AI, Bbv II, Bee fl. Bee 851, Bcc I, Beg I Bsa L Bsg I, Bsp MI, Bst 71 I, Ear I Eco 571, 

1 5 Esp 31, Fau I, Fok I, Gsu I, Hph I, Mbo II, Mme I, Rle AI, Sap 1, Sfa NI, Taq II, Tlh 
1 1 III, Bco 51, Bpu AI, Fin I, Bsr Dl, and isoschizomers thereof Prefened nucleases 
include Fok I, Bbv I, Hga I, Ear I, and Sfa NI. 

Preferably, prior to nuclease cleavage steps, usually at the start of an 
identification cycle, the polynucleotide is treated to block the recognition sites and/or 

20 cleavage sites of the nuclease being employed. This prevents undesired cleavage of the 
polynucleotide because of the fortuitous occurrence of nuclease recognition sites at 
interior locations in the polynucleotide. Blocking can be achieved in a variety of ways, 
including methylation and treatment by sequence-specific aptamers, DNA binding 
proteins, or oligonucleotides that form triplexes. Whenever natural protein 

25 endonucleases are employed, recognition sites can be conveniently blocked by 

methylating the polynucleotide with the cognate methylase of the nuclease being used. 
That is, for most if not all type II bacterial restriction endonucleases, there exists a so- 
called "cognate" methylases that methylates its recognition site. Many such methylases 
are disclosed in Roberts et a) (cited above) and Nelson et al. Nucleic Acids Research, 

30 2 1 : 3 1 39-3 1 54 ( 1 993), and are commercially available from a variety of sources, 
particularly New England Biolabs (Beverly, MA). 

Oligonucleotide Tags and Tag Complements 
In one aspect, the oligonucleotide tags of the invention comprise a plurality of 
35 "words" or subunits selected from minimally cross-hybridizing sets of subunits. 

Subunits of such sets cannot form a duplex or triplex with the complement of another 
subunit of the same set with less than two mismatched nucleotides. Thus, the 
sequences of any two oligonucleotide tags of a repertoire that form duplexes will 
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never be "closer" than differing by two nucleotides. In particular embodiments, 
sequences of any two oligonucleotide tags of a repertoire can be even "further" apart, 
e.g. by designing a minimally cross-hybridizing set such that subunits cannot form a 
duplex with the complement of another subunit of the same set with less than three 
5 mismatched nucleotides, and so on. Usually, oligonucleotide tags of the invention 
and their complements are oligomers of the natural nucleotides so that they may be 
conveniently processed by enzymes, such as ligases, polymerases, nucleases, terminal 
transferases, and the like. 

Complements of oligonucleotide tags of the invention, referred to herein as 
1 0 "tag complements," may comprise natural nucleotides or non-natural nucleotide 
analogs. Preferably, tag complements are attached to solid phase supports. 

Minimally cross-hybridizing sets of oligonucleotide tags and tag complements 
may be synthesized either combinatorially or individually depending on the size of the 
set desired and the degree to which cross-hybridization is sought to be minimized (or 
1 5 stated another way, the degree to which specificity is sought to be enhanced). For 
example, a minimally cross-hybridizing set may consist of a set of individually 
synthesized 10-mer sequences that differ from each other by at least 4 nucleotides, 
such set having a maximum size of 332 (when composed of 3 kinds of nucleotides 
and counted using a computer program such as disclosed in Appendix Ic). 
20 Alternatively, a minimally cross-hybridizing set of oligonucleotide tags may also be 
assembled combinatorially from subunits which themselves are selected from a 
minimally cross-hybridizing set. For example, a set of minimally cross-hybridizing 
12-mers differing from one another by at least three nucleotides may be synthesized 
by assembling 3 subunits selected from a set of minimally cross-hybridizing 4-mers 
25 that each differ from one another by three nucleotides. Such an embodiment gives a 
maximally sized set of 9^ or 729, 12-mers. The number 9 is number of 
oligonucleotides listed by the computer program of Appendix la, which assumes, as 
with the 10-mers, that only 3 of the 4 different types of nucleotides are used. The set 
is described as "maximal" because the computer programs of Appendices la-c provide 
30 the largest set for a given input (e.g. length, composition, difference in number of 
nucleotides between members). Additional minimally cross-hybridizing sets may be 
formed from subsets of such calculated sets. 

Oligonucleotide tags may be single stranded and be designed for specific 
hybridization to single suanded tag complements by duplex formation or for specific 
35 hybridization to double su-anded tag complements by triplex formation. 

Oligonucleotide tags may also be double stranded and be designed for specific 
hybridization to single stranded tag complements by triplex formation. Preferably, as 
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used for shuttling sequence information, lags and tag complements are single 
stranded. 

When synthesized combinatorially, an oligonucleotide tag of the invention 
preferably consists of a plurality of subunits, each subunit consisting of an 
5 oligonucleotide of 3 to 9 nucleotides in length wherein each subunit is selected from 
the same minimally cross-hybridizing set. In such embodiments, the number of 
oligonucleotide tags available depends on the number of subunits per tag and on the 
length of the subunits. The number is generally much less than the number of all 
possible sequences the length of the tag, which for a tag n nucleotides long would be 
10 4" 

The nucleotide sequences of oligonucleotides of a minimally cross-hybridizing 
set are conveniently enumerated by simple computer programs following the general 
algorithm illustrated in Fig. K and as exemplified by programs whose soiu-ce codes 
are listed in Appendices la and lb. Program minhx of Appendix la computes all 

1 5 minimally cross-hybridizing sets having 4-mer subunits composed of three kinds of 
nucleotides. Program tagN of Appendix lb enumerates longer oligonucleotides of a 
minimally cross-hybridizing set. Similar algorithms and computer programs are 
readily written for listing oligonucleotides of minimally cross-hybridizing sets for any 
embodiment of the invention. Table I below provides guidance as to the size of sets 

20 of minimally cross-hybridizing oligonucleotides for the indicated lengths and number 
of nucleotide differences. The above computer programs were used to generate the 
numbers. 

25 



30 



35 



-24- 



wo 97/32999 



PCT/US96/18708 



Table I 



igonucleoiid 
c 

Word 
Length 


Difference 
between 
Oligonucleotides 
of Minimally 
Cross- 
Hybridizing Sc! 


Maximal Size 
of Minimally 

Cross- 
I fybridizing 
Set 


Size of 
Repertoire 
■ with Four 
Words 


Size of 

rvcjjciiviic will) 

five Words 


4 


3 


9 


6561 


5.90 X 10^ 


6 


3 


27 


5.3 X 10^ 


7 

I.43x lO' 


7 


4 


27 


5.3 X 10^ 


1.43 X 10^ 


7 


5 


8 


4096 


3.28 X 10^ 


8 


3 


190 


1.30 X 10^ 


2.48x 10*' 


8 


4 


62 


1.48 X 10^ 


9.16 X 10** 


8 


5 


18 


1.05 X 10^ 


1.89 X 10^ 


9 


5 


39 


2,31 X 10^ 


9.02 X 10^ 


10 


5 


332 


1.21x10^^ 




10 


6 


28 


6.15 X 10^ 


1.72 X 10^ 


]| 


5 


187 






IS 


6 


«25O0O 






18 


12 


24 







For some embodiments of the invention, where extremely large repertoires of 
tags are not required, oligonucleotide tags of a minimally cross-hybridizing set may 
5 be separately synthesized. Sets containing several hundred to several thousands, or 
even several tens of thousands, of oligonucleotides may be synthesized directly by a 
variety of parallel synthesis approaches, e.g. as disclosed in Frank et al, U.S. patent 
4,689,405; Frank el al, Nucleic Acids Research, 1 1 : 4365-4377 (1983); Matson et aL 
Anal. Biochem., 224: 1 10-1 16 (1995); Fodor et al, International application 
10 PCT/US93/04145; Pease et al, Proc. Natl. Acad. Sci., 91 : 5022-5026 (1994); 
Southern et al, J. Biotechnology, 35: 217-227 (1994), Brennan, International 
application PCT/US94/05896; Lashkari et al, Proc. Natl. Acad. Sci., 92: 7912-7915 
(1995); or the like. 

. -Preferably, oligonucleotide tags of the invention are synthesized 
! 5 combinaiorially out of subunits between three and six nucleotides in length and 
selected from the same minimally cross-hybridizing set. For oligonucletides in this 
range, the members of such sets may be enumerated by computer programs based on 
the algorithm of Fig. 1. 

The algorithm of Fig. 1 is implemented by first defining the characteristics of 
20 the subunits of the minimally cross-hybridizing set, i.e. length, number of base 
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differences between members, and composition, e.g. do they consist of two, three, or 
four kinds of bases. A table M^, n=I, is generated (100) that consists of all possible 
sequences of a given length and composition. An initial subunit S] is selected and 
compared (120) with successive subunits Sj for i=n+l to the end of the table. 
5 Whenever a successive subunit has the required number of mismatches to be a 

member of the minimally cross-hybridizing set, it is saved in a new table M^+i (125), 
that also contains subunits previously selected in prior passes through step 120. For 
example, in the first set of comparisons, M2 will contain S j ; in the second set of 
comparisons, M3 will contain Sj and S2; in the third set of comparisons, M4 will 

10 contain Sj, S2, and S3; and so on. Similarly, comparisons in table Mj will be 

between Sj and all successive subunits in Mj. Note that each successive table M^+i 
is smaller than its predecessors as subunits are eliminated in successive passes 
through step 1 30. After every subunit of table has been compared (140) the old 
table is replaced by the new table M^-i-i, and the next round of comparisons are 

1 5 begun. The process stops ( 1 60) when a table is reached that contains no 
successive subunits to compare to the selected subunit Sj, i.e. M^-Mn-i-i . 

Preferably, minimally cross-hybridizing sets comprise subunits that make 
approximately equivalent contributions to duplex stability as every other subunit in 
the set. In this way, the stability of perfectly matched duplexes between every subunit 

20 and its complement is approximately equal. Guidance for selecting such sets is 

provided by published techniques for selecting optimal PCR primers and calculating 
duplex stabilities, e.g. Rychlik et al, Nucleic Acids Research, 17: 8543-8551 (1989) 
and 18: 6409-6412 (1990); Breslauer et al, Proc. Natl. Acad. Sci., 83: 3746-3750 
(1986); Wetmur, Crit. Rev. Biochem. Mol. Biol., 26: 227-259 (1991);and the like. 

25 For shorter tags, e.g. about 30 nucleotides or less, the algorithm described by Rychlik 
and Wetmur is preferred, and for longer tags, e.g. about 30-35 nucleotides or greater, 
an algorithm disclosed by Suggs et al, pages 683-693 in Brown, editor, ICN-UCLA 
Symp. Dev. Biol.. Vol. 23 (Academic Press, New York, 1981) may be conveniently 
employed. Clearly, the are many approaches available to one skilled in the art for 

30 designing sets of minimally cross-hybridizing subunits within the scope of the 

invention. For example, to minimize the affects of different base-stacking energies of 
terminal nucleotides when subunits are assembled, subunits may be provided that" 
have the same terminal nucleotides. In this way, when subunits are linked, the sum of 
the base-stacking energies of all the adjoining terminal nucleotides will be the same. 

35 thereby reducing or eliminating variability in tag melting temperatures. 

A ''word" of terminal nucleotides, shown in italic below, may also be added to 
each end of a tag so that a perfect match is always formed between it and a similar 
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terminal '"word" on any other tag complement. Such an augmented tag would have 
the form: 





W, 


W2 






w 


W 


W,' 








w 



15 



where the primed W's indicate complements. With ends of tags always forming 
perfectly matched duplexes, all mismatched words will be internal mismatches 
thereby reducing the stability of tag-complement duplexes that otherwise would have 
mismatched words at their ends. It is well known that duplexes with internal 
mismatches are significantly less stable than duplexes with the same mismatch at a 



ICIiUllJUa. 



A preferred embodiment of minimally cross-hybridizing sets are those whose 
subunits are made up of three of the four natural nucleotides. As will be discussed 
more fully below, the absence of one type of nucleotide in the oligonucleotide tags 
permits polraucleotides to be loaded onto solid phase supports by use of the 5'-»3' 
exonuclease activity of a DNA polymerase. The following is an exemplary minimally 
cross-hybridizing set of subunits each comprising four nucleotides selected from the 
group consisting of A, G, and T: 



20 



Word: 

Sequence : 



W| 

GATT 



Table II 

W2 
TGAT 



W3 
TAGA 



W4 

TTTG 



Word: W5 w^ W7 wg 

Sequence: GTAA AGTA ATGT AAAG 



In this set, each member would form a duplex having three mismatched bases with 
25 the complement of every other member. 

Further exemplary minimally cross-hybridizing sets are listed below in Table 
III. Clearly, additional sets can be generated by substituting different groups of 
nucleotides, or by using subsets of known minimally cross-hybridizing sets. 



30 
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Table III 

Exemplary Minimally Cross-Hvbridizine Sets of 4-mer Subimits 



Set 1 


Set 2 


Set 3 


Set A 


Set 5 


Set 6 


CATT 


ACCC 


AAAC 


AAAG 


AACA 


AACG 


CTAA 


AGGG 


ACCA 


ACCA 


ACAC 


ACAA 


TCAT 


CACG 


AGGG 


AGGC 


AGGG 


AGGC 


ACTA 


CCGA 


CACG 


CACC 


CAAG 


CAAC 


TACA 


CGAC 


CCGC 


CCGG 


CCGC 


CCGG 


TTTC 


GAGC 


CGAA 


CGAA 


CGCA 


CGCA 


ATCT 


GCAG 


GAGA 


GAGA 


GAGA 


GAGA 


AAAC 


GGCA 


GCAG 


GCAC 


GCCG 


GCCC 




AAAA 


GGCC 


GGCG 


GGAC 


GGAG 


Set 7 


Set 8 


Set 9 


Set 10 


Set 11 


Set 12 


AAGA 


AAGC 


AAGG 


ACAG 


ACCG 


ACGA 


ACAC 


ACAA 


ACAA 


AACA 


AAAA 


AAAC 


AGCG 


AGCG 


AGCC 


AGGC 


AGGC 


AGCG 


CAAG 


CAAG 


CAAC 


CAAC 


CACC 


CACA 


CCCA 


CCCC 


CCCG 


CCGA 


CCGA 


CCAG 


CGGC 


CGGA 


CGGA 


CGCG 


CGAG 


CGGC 


GACC 


GACA 


GACA 


GAGG 


GAGG 


GAGG 


GCGG 


GCGG 


GCGC 


GCCC 


GCAC 


GCCC 


GGAA 


GGAC 


GGAG 


GGAA 


GGCA 


GGAA 



5 The oligonucleotide tags of the invention and their complements arc 

conveniently synthesized on an automated DNA synthesizer, e.g. an Applied 
Biosystems, Inc. (Foster City, California) model 392 or 394 DN A/RNA Synthesizer, 
using standard chemistries, such as phosphoramidite chemistry, e.g. disclosed in the 
following references: Beaucage and Iyer Tetrahedron, 48: 2223-231 1 (1992); Molko 

10 et al, U.S. patent 4,980,460; Koster et al, U.S. patent 4,725,677; Caruthers et al, U.S. 
patents 4,415,732; 4,458,066; and 4,973,679; and the like. Allemaiive chemistries, 
e.g. resulting in non-natural backbone groups, such as phosphorothioate, 
phosphoramidaie, and the like, may also be employed provided that the resulting 
oligonucleotides are capable of specific hybridization. In some embodiments, tags 

1 5 may comprise naturally occurring nucleotides that permit processing or manipulation 

b^^ enzymes, while the corresponding tag complements may comprise non-natural - 
nucleotide analogs, such as peptide nucleic acids, or like compounds, that promote the 
formation of more stable duplexes during sorting. 

When microparticles are used as supports, repertoires of oligonucleotide tags 

20 and tag complements may be generated by subunit-wise synthesis via "split and mix" 
techniques, e.g. as disclosed in Shortle et al, International patent application 
PCT/US93/0341 8 or Lyttle ei al, Biotechniques, 19: 274-280 (1995). Briefly, the 
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basic unit of the synthesis is a subunii of the oligonucleotide tag. Preferably, 
phosphoramidite chemistry is used and 3' phosphoramidite oligonucleotides are 
prepared for each subunit in a minimally cross-hybridizing set, e.g. for the set first 
listed above, there would be eight 4-mer 3*-phosphoramidites. Synthesis proceeds as 
5 disclosed by Shortle et al or in direct analogy with the techniques employed to 
generate diverse oligonucleotide libraries using nucleosidic monomers, e.g. as 
disclosed in Telenius et al, Genomics, 13: 718-725 (1992); Welsh et al. Nucleic Acids 
Research, 19: 5275-5279 (1991); Grothues et al, Nucleic Acids Research, 21 : 1321 - 
1322 (1993); Hartley, European patent application 90304496.4; Lam et al, Nature, 
10 354: 82-84 (1991); Zuckerman et al, Int. J. Pepl. Protein Research, 40: 498-507 
(1992); and the like. Generally, these techniques simply call for the application of 
mixtures of the activated monomers to the growing oligonucleotide during the 
coupling steps. Preferably, oligonucleotide tags and tag complements are synthesized 
on a DNA synthesizer having a number of synthesis chambers which is greater than or 
1 5 equal to the number of different kinds of words used in the construction of the tags. 
That is, preferably there is a synthesis chamber corresponding to each type of word. 
In this embodiment, words are added nucleotide-by-nucleotide, such that if a word 
consists of five nucleotides there are five monomer couplings in each synthesis 
chamber. After a word is completely synthesized, the synthesis supports are removed 
20 from the chambers, mixed, and redistributed back to the chambers for the next cycle 
of word addition. This latter embodiment lakes advantage of the high coupling yields 
of monomer addition, e.g. in phosphoramidite chemistries. 

Double stranded forms of tags may be made by separately synthesizing the 
complementary strands followed by mixing under conditions that permit duplex 
25 formation. Alternatively, double stranded tags may be formed by first synthesizing a 
single stranded repertoire linked to a knoun oligonucleotide sequence that serves as a 
primer binding site. The second strand is then synthesized by combining the single 
stranded repertoire with a primer and extending with a polymerase. This latter 
approach is described in Oliphant et al. Gene, 44: 1 77-1 83 (1986). Such duplex tags 
30 may then be inserted into cloning vectors along with polynucleotides for sorting and 
manipulation of the polynucleotide in accordance with the invention. 

" " When tag complements are employed that are made up of nucleotides that 
have enhanced binding characteristics, such as PNAs or oligonucleotide N3'-^P5' 
phosphoramidates, sorting can be implemented through the formation of D-loops 
35 between tags comprising natural nucleotides and their PNA or phosphoramidate 
complements, as an alternative to the "stripping" reaction employing the 3'-»5' 
exonuclease activity of a DNA polymerase to render a tag single stranded. 
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Oligonucleotide tags of the invention may range in length from 12 to 60 
nucleotides or basepairs. Preferably, oligonucleotide tags range in length from 1 8 to 
40 nucleotides or basepairs. More preferably, oligonucleotide tags range in length 
from 25 to 40 nucleotides or basepairs. In terms of preferred and more preferred 
5 numbers of subunits, these ranges may be expressed as follows: 

Table IV 

Numbers of Subunits in Tags in Preferred Embodiments 
1 0 Monomers 



in Subunit Nucleotides in Oligonucleotide Tag 

(12-60) (18-40) (25-40) 

3 4-20 subunits 6- 1 3 subunits 8- 1 3 subunits 

4 3-15 subunits 4-10 subunits 6- 1 0 subunits 

5 2-12 subunits 3-8 subunits 5-8 subunits 

6 2-10 subunits 3-6 subunits 4-6 subunits 



Most preferably, oligonucleotide lags are single stranded and specific hybridization 
occurs via Watson-Crick pairing with a tag complement. 
1 5 Preferably, repertoires of single stranded oligonucleotide tags of the invention 

contain at least 100 members; more preferably, repertoires of such tags contain at 
least 1000 members; and most preferably, repertoires of such tags contain at least 
10,000 members. 

20 Triplex Tags 

In embodiments where specific hybridization occurs via triplex formation, 
coding of tag sequences follows the same principles as for duplex-forming tags; 
however, there are further constraints on the selection of subimit sequences. 
Generally, third strand association via Hoogsteen type of binding is most stable along 

25 homopyrimidine-homopurine tracks in a double stranded target. Usually, base triplets 
form in T-A*T or C-G*C motifs (where "-" indicates Watson-Crick pairing and 
indicates Hoogsteen type of binding); however, other motifs are also possible. For 
example, Hoogsteen base pairing permits parallel and antiparallel orientations 
between the third strand (the Hoogsteen strand) and the purine-rich strand of the 

30 duplex to which the third strand binds, depending on conditions and the composition 
of the strands. There is extensive guidance in the literature for selecting appropriate 
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sequences, orientation, conditions, nucleoside type (e.g. whether ribose or 
deoxyribose nucleosides are employed), base modifications (e.g. methylated cyiosine, 
and the like) in order to maximize, or otherwise regulate, triplex stability as desired in 
particular embodiments, e.g. Roberts et al, Proc. Natl. Acad. Sci./88: 9397-9401 
5 (1991); Roberts et al, Science, 258: 1463-1466 (1992); Roberts et aK Proc. Natl. 
Acad. Sci., 93: 4320-4325 (1996); Distefano et al, Proc. Natl. Acad. Sci., 90: 11 79- 
1 183 (1993): Mergny et al. Biochemistry, 30: 9791-9798 (1991); Cheng et al, J. Am. 
Chem. Soc, 1 14: 4465-4474 (1992); Beal and Dervan, Nucleic Acids Research, 20: 
2773-2776 (1992); Beal and Dervan, J. Am. Chem. Soc, 1 14: 4976-4982 (1992); 
10 Giovannangeli et al, Proc. Natl. Acad. Sci., 89: 8631-8635 (1992); Moser and Dervan, 
Science, 238: 645-650 (1987); McShan et al, J. Biol. Chem., 267:5712-5721 (1992); 
Yoon et al, Proc. Natl. Acad. Sci., 89: 3840-3844 (1992); Blume et al, Nucleic Acids 
Research. 20: 1777-1784 (1992); Thuong and Helene, Angew. Chem. Int. Ed. Engl. 
32: 666-690 (1993); Escude et al, Proc. Natl. Acad. Sci., 93: 4365-4369 (1996); and 
1 5 the like. Conditions for amiealing single-stranded or duplex tags to their single- 
stranded or duplex complements are well known, e.g. Ji et al. Anal. Chem. 65: 1323- 
1328 (1993); Cantor et al, U.S. patent 5,482,836; and the like. Use of triplex tags has 
the advantage of not requiring a "stripping" reaction with polymerase to expose the 
tag for amiealing to its complement. 
20 Preferably, oligonucleotide tags of the invention employing triplex 

hybridization are double stranded DNA and the corresponding tag complements are 
single stranded. More preferably, 5-methylcytosine is used in place of cytosine in the 
lag complements in order to broaden the range of pH stability of the triplex formed 
between a tag and its complement. Preferred conditions for forming triplexes are 
25 fully disclosed in the above references. Briefly, hybridization takes place in 

concentrated salt solution, e.g. 1 .0 M NaCl, 1 .0 M potassium acetate, or the like, at 
pH below 5.5 ( or 6.5 if 5-methylcytosine is employed). Hybridization temperature 
depends on the length and composition of the tag; however, for an 1 8-20-mer tag of 
longer, hybridization at room temperature is adequate. Washes may be conducted 
30 with less concentrated salt solutions, e.g. 1 0 mM sodium acetate, 1 00 mM MgCl2, pH 
5.8, at room temperature. Tags may be eluted from their tag complements by 
incubation in a similar salt solution at pH 9.0. - - - 

Minimally cross-hybridizing sets of oligonucleotide tags that form triplexes 
may be generated by the computer program of Appendix Ic, or similar programs. An 
35 exemplary set of double stranded 8-mer words are listed below in capital leners with 
the corresponding complements in small letters. Each such word differs from each of 
the other words in the set by three base pairs. 
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Table V 

Exemplary Minimally Cross-Hybridizing 
Set of Double Stranded 8-mer Tags 



5' 


-AAGGAGAG 


5' 


-AAAGGGGA 


5' 


-AGAGAAGA 


b' 


"AGGGGGGG 


3' 


-TTCCTCTC 


3' 


-TTTCCCCT 


3' 


-TCTCTTCT 


3' 


-TCCCCCCC 


3' 


-ttcctctc 


3' 


-tttcccct 


3' 


-tctcttct 


3' 


-tccccccc 


5' 


-AAAAAAAA 


5' 


-AAGAGAGA 


5' 


-AGGAAAAG 


5' 


-GAAAGGAG 


3' 




3' 


-TTCTCTCT 


3' 


-TCCTTTTC 


3' 


-CTTTCCTC 


3' 


-tttttttt 


3' 


-ttctctct 


3' 


-tcctcttc 


3' 


-ctttcctc 


5' 


-AAAAAGGG 


5' 


-AGAAGAGG 


5' 


-AGGAAGGA 


C f 


-GAAGAAGG 


3' 


-TTTTTCCC 


3' 


-TCTTCTCC 


3' 


-TCCTTCCT 


3' 


-CTTCTTCC 


3' 


-tttttccc 


3' 


-tcttctcc " 


3' 


-tccttcct 


3' 


-cttcttcc 


b' 


-AAAGGAAG 


5' 


-AGAAGGAA 


5' 


-AGGGGAAA 


5' 


-GAAGAGAA 


3'' 


-TTTCCTTC 


3 * 


-TCTTCCTT 


J* 


- 1 i«v«\-t« III 


3 ' 


-CTTCTCTT 


3' 


-tttccttc 


3' 


-tcttcctt 


3' 


-tccccttt 


3' 


-cttctctt 



5 

Table VI 

Repertoire Size of Various Double Stranded Tags 
That Form Triplexes with Their Tag Complements 

10 

Nucleotide 
Difference 

between Maximal Size 

Oligonucleotides of Minimally Size of 



Oligonucleolid of Minimally Cross- Repertoire Size of 

e Cross- Hybridizing with Four Repertoire with 

Word Hybridizing Set Set Words Five Words 

Length 

4 2 ' 8 4096 3.2 \ lO'* 

6 3 8 4096 3.2x 10'' 

8 3 16 6.5 x10" 1.05x10^ 

10 5 8 4096 

15 5 92 

20 6 765 

20 8 92 

20 - 10 22 



Preferably, repertoires of double stranded oligonucleotide tags of the invention 
contain at least 1 0 members; more preferably, repertoires of such tags contain at least 
1 00 members. Preferably, words are between 4 and 8 nucleotides in length for 
1 5 combinatorially synthesized double stranded oligonucletide tags, and oligonucleotide 
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lags are between 12 and 60 base pairs in length. More preferably, such tags are 
between 18 and 40 base pairs in length. 

Solid Phase Supports 

5 Solid phase supports for use with the invention may have a wide variety of 

forms, including microparticles, beads, and membranes, slides, plates, micromachined 
chips, and the like. Likewise, solid phase supports of the invention may comprise a 
wide variety of compositions, including glass, plastic, silicon, alkanethiolate- 
derivatized gold, cellulose, low cross-linked and high cross-linked polystyrene, silica 
1 0 gel, polyamide, and the like. Preferably, either a population of discrete particles are 
employed such that each has a uniform coating, or population, of complementary 
sequences of the same tag (and no other), or a single or a few supports are employed 
with spatially discrete regions each containing a uniform coating, or population, of 
complementary sequences to the same tag (and no other). In the latter embodiment, 
1 5 the area of the regions may vary according to particular applications; usually, the 
regions range in area from several \im^, e.g. 3-5, to several hundred ^m2, e.g. 100- 
500. Preferably, such regions are spatially discrete so that signals generated by 
events, e.g. fluorescent emissions, at adjacent regions can be resolved by the detection 
system being employed. In some applications, it may be desirable to have regions 
20 with uniform coatings of more than one tag complement, e.g. for simultaneous 

sequence analysis, or for bringing separately tagged molecules into close proximity. 

Tag complements may be used with the solid phase support that they are 
synthesized on, or they may be separately synthesized and attached to a solid phase 
support for use, e.g. as disclosed by Lund el al. Nucleic Acids Research, 16: 10861- 
25 10880 (1988); Albretsen et al, Anal. Biochem., 189: 40-50 (1990); Wolf el al, Nucleic 
Acids Research, 15: 291 1-2926 (1987); or Ghosh et al. Nucleic Acids Research, 15: 
5353-5372 (1987). Preferably, tag complements are synthesized on and used with the 
same solid phase support, which may comprise a variety of forms and include a 
variety of linking moieties. Such supports may comprise microparticles or arrays, or 
30 matrices, of regions where uniform populations of tag complements are synthesized. 
A wide variety of microparticle supports may be used with the invention, including 
microparticles made of controlled pore glass (CPG), highly cross-linked polystyrene, 
acrylic copolymers, cellulose, nylon, dextran, latex, polyacrolein, and the like, 
disclosed in the following exemplary references: Meth. Enzymol., Section A, pages 
35 1 1-147, vol. 44 (Academic Press, New York, 1976); U.S. patents 4,678,814; 
4,413,070; and 4,046;720; and Pon, Chapter 19, in Agrawal, editor. Methods in 
Molecular Biology, VoL 20, (Humana Press, Totowa, NJ, 1993), Microparticle 
supports further include commercially available nucleoside-derivatized CPG and 
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polystyrene beads (e.g. available from Applied Biosystems, Foster City, CA); 
derivatized magnetic beads; polystjaene grafted with polyethylene glycol (e.g., 
TentaGel^'^, Rapp Polymere, Tubingen Germany); and the like. Selection of the 
support characteristics, such as material, porosity, size, shape, and the like, and the 
5 type of linking moiety employed depends on the conditions under which the tags are 
used. For example, in applications involving successive processing with enzymes, 
supports and linkers that minimize steric hindrance of the enzymes and that facilitate 
access to substrate are preferred. Other important factors to be considered in selecting 
the most appropriate microparticle support include size uniformity, efficiency as a 

10 synthesis support, degree to which surface area known, and optical properties, e.g. as 
explain more fully below, clear smooth beads provide instrumentational advantages 
when handling large numbers of beads on a surface. 

Exemplary linking moieties for attaching and/or synthesizing tags on 
microparticle surfaces are disclosed in Pon et ah Biotechniques, 6:768-775 (1988); 

1 5 Webb, U.S. patent 4,659,774; Barany et al. International patent application 

PCT/US9 1/06 103; Brown et al, J. Chem. Soc. Commun., 1989: 891-893; Damha el 
al. Nucleic Acids Research, 18: 3813-3821 (1990); Beattie et al, Clinical Chemistry, 
39: 719-722 (1993); Maskos and Southern, Nucleic Acids Research, 20: 1679-1684 
(1992); and the like. 

20 As mentioned above, tag complements may also be synthesized on a single 

(or a few) solid phase support to form an array of regions unifomily coated with tag 
complements. That is, within each region in such an array the same tag complement 
is synthesized. Techniques for synthesizing such arrays are disclosed in McGall et al, 
International application PCT/US93/03767; Pease et al, Proc. Natl. Acad. Sci., 91 : 

25 5022-5026 (1994); Southern and Maskos, International application 

PCT/GB89/01 1 14; Maskos and Southern (cited above); Southern et al, Genomics, 13: 
1008-101 7 (1992); and Maskos and Southern, Nucleic Acids Research, 21 : 4663- 
4669(1993). 

Preferably, the invention is implemented with microparticles or beads 
30 uniformly coated with complements of the same tag sequence. Microparticle supports 
and methods of covaJently or noncovalently linking oligonucleotides to their surfaces 
are well known, as exemplified by the following references: Beaucage and Iyer (cited 
above); Gait, editor, Oligonucleotide Synthesis: A Practical Approach (IRL Press, 
Oxford, 1984): and the references cited above. Generally, the size and shape of a 
35 microparticle is not critical; however, microparticles in the size range of a few, e.g. 1- 
2, to several hundred, e.g. 200-1000 ^un diameter are preferable, as they facilitate the 
construction and manipulation of large repertoires of oligonucleotide tags with 
minimal reagent and sample usage. 
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In some preferred applications, commercially available controlled-pore glass 
(CPG ) or polystyrene supports are employed as solid phase supports in the invention. 
Such supports come available with base-labile linkers and initial nucleosides attached, 
e.g. Applied Biosystems (Foster City, CA). Preferably, microparticles having pore 
5 size between 500 and 1000 angstroms are employed. 

In other preferred applications, non-porous microparticles are employed for 
their optical properties, which may be advantageously used when tracking large 
numbers of microparticles on planar supports, such as a microscope slide. 
Particularly preferred non-porous microparticles are the glycidal melhacrylate (GMA) 
10 beads available from Bangs Laboratories (Carmel, IN). Such microparticles are 
useful in a variety of sizes and derivatized with a variety of linkage groups for 
synthesizing tags or tag complements. Preferablv. for massivelv narallel 
manipulations of tagged microparticles, 5 jam diameter GMA beads are employed. 

15 Attaching Tags to Polynucleotides 

For Sorting onto Solid Phase Supports 
An important aspect of the invention is the sorting and attachment of a 
populations of polynucleotides, e.g. from a cDNA library, to microparticles or to 
separate regions on a solid phase support such that each microparticle or region has 
20 substantially only one kind of polynucleotide attached. This objective is 

accomplished by insuring that substantially all different polynucleotides have 
different tags attached. This condition, in turn, is brought about by taking a sample of 
the full ensemble of tag-polynucleotide conjugates for analysis. (It is acceptable that 
identical polynucleotides have different tags, as it merely results in the same 
25 polynucleotide being operated on or analyzed twice in two different locations.) Such 
sampling can be carried out either overtly-for example, by taking a small volume 
from a larger mixture-after the tags have been attached to the polynucleotides, it can 
be carried out inherently as a secondary effect of the techniques used to process the 
polynucleotides and tags, or sampling can be carried out both overtly and as an 
30 inherent part of processing steps. 

Preferably, in constructing a cDNA library where substantially all different 
cDNAs have different tags, a tag repertoire is employed whose complexity, or number 
of distinct lags, greatly exceeds the total number of mRNAs extracted from a cell or 
tissue sample. Preferably, the complexity of the tag repertoire is at least 1 0 times that 
35 of the polynucleotide population; and more preferably, the complexity of the tag 
repertoire is at least 100 times that of the polynucleotide population. Below, a 
protocol is disclosed for cDNA library construction using a primer mixture that 
contains a full repertoire of exemplary 9-word tags. Such a mixture of tag-containing 
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primers has a complexity of 8^ or about 1.34x1 0^ As indicated by Winslow et al, 
Nucleic Acids Research, 19: 3251-3253 (1991), mRNA for library construction can 
be extracted from as few as 10-100 mammalian cells. Since a single mammalian cell 
contains about 5x10^ copies of mRNA molecules of about 3.4 x 1 0'* different kinds, 
5 by standard techniques one can isolate the mRNA from about 1 00 cells, or 
(theoretically) about 5x10^ mRNA molecules. Comparing this number to the 
complexity of the primer mixture shows that without any additional steps, and even 
assuming that mRNAs are converted into cDNAs with perfect efficiency (1% 
efficiency or less is more accurate), the cDNA library construction protocol results in 

1 0 a population containing no more than 37% of the total number of different tags. That 
is, without any overt sampling step at all, the protocol inherently generates a sample 
that comprises 37%, or less, of the tag repertoire. The probability of obtaining a 
double under these conditions is about 5%, which is within the preferred range. With 
mRNA from 10 cells, the fraction of the tag repertoire sampled is reduced to only 

15 3.7%, even assuming that all the processing steps take place at 100% efficiency. In 
fact, the efficiencies of the processing steps for constructing cDNA libraries are very 
low, a "rule of thumb" being that good library should contain about 10^ cDNA clones 
from mRNA extracted from 10^ mammalian cells. 

Use of larger amounts of mRNA in the above protocol, or for larger amounts 

20 of polynucleotides in general, where the number of such molecules exceeds the 

complexity of the tag repertoire, a tag-polynucleotide conjugate mixture potentially 
contains every possible pairing of tags and types of mRNA or polynucleotide. In such 
cases, overt sampling may be implemented by removing a sample volume after a 
serial dilution of the starting mixture of tag-polynucleotide conjugates. The amount 

25 of dilution required depends on the amount of starting material and the efficiencies of 
the processing steps, which are readily estimated. 

If mRNA were extracted from 10* cells (which would correspond to about 0.5 
fig of poly(A)* RNA), and if primers were present in about 10-100 fold concentration 
excess--as is called for in a typical protocol, e.g. Sambrook et al, Molecular Cloning, 

30 Second Edition, page 8.61 [10 1 .8 kb mRNA at 1 mg/mL equals about 1 .68 x lO " 
moles and 10 |iL 1 8-mer primer at 1 mg/mL equals about 1 .68 x 1 0'^ moles], then the 
total number of tag-polynucleotide conjugates in a cDNA library would simply be 
equal to or less than the starting number of mRNAs, or about 5x10*^ vectors 
containing tag-polynucleotide conjugates-again this assumes that each step in cDNA 

35 construction-first strand synthesis, second strand synthesis, ligation into a vector- 
occurs with perfect efficiency, which is a very conservative estimate. The actual 
number is significantly less. 
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If a sample of n lag-polynucleoiide conjugates are randomly drawn from a 
reaction mixture-as could be effected by taking a sample volume, the probability of 
drawing conjugates having the same tag is described by the Poisson distribution, 

P(r)=e"\x)Vr, where r is the number of conjugates having the same tag and >-=np, 
5 where p is the probabiiit}' of a given tag being selected. If n=10^ and p=l/(1.34 x 
10^), then X=-00746 and P(2)=2.76 x 10"^ Thus, a sample of one million molecules 
gives rise to an expected number of doubles well within the preferred range. Such a 
sample is readily obtained as follows: Assume that the 5 x lO" mRNAs are perfectly 
converted into 5 x 10^' vectors with tag-cDNA conjugates as inserts and that the 5 x 
10 10*' vectors are in a reaction solution having a volume of 1 00 ^1. Four 1 0-fold serial 
dilutions may be carried out by transferring 10 |il from the original solution into a 
vessel containing 90 \il of an appropriate buffer, such as TE. This process may be 
repeated for three additional dilutions to obtain a 100 ^1 solution containing 5x10^ 
vector molecules per fil. A 2 |il aliquot from this solution yields 10^ vectors 
1 5 containing tag-cDNA conjugates as inserts. This sample is then amplified by straight 
forward transformation of a competent host cell followed by culturing. 

Of course, as mentioned above, no step in the above process proceeds with 
perfect efficiency. In particular, when vectors are employed to amplify a sample of 
tag-polynucleotide conjugates, the step of transforming a host is very inefficient. 
20 Usually, no more than 1% of the vectors are taken up by the host and replicated. 
Thus, for such a method of amplification, even fewer dilutions would be required to 
obtain a sample of 1 0^ conjugates. 

A repertoire of oligonucleotide tags can be conjugated to a population of 
polynucleotides in a number of ways, including direct enzymatic ligation, 
25 amplification, e.g. via PCR, using primers containing the tag sequences, and the like, 
TTie initial ligating step produces a very large population of tag-polynucleotide 
conjugates such that a single tag is generally attached to many different 
polynucleotides. However, as noted above, by taking a sufficiently small sample of 
the conjugates, the probability of obtaining "doubles," i.e. the same tag on two 
30 different polynucleotides, can be made negligible. Generally, the larger the sample 
the greater the probability of obtaining a double. Thus, a design trade-off exists 
between selecting a large sample of tag-polynucleotide conjugates- which, for 
example, ensures adequate coverage of a polynucleotide in a shotgun sequencing 
operation or adequate representation of a rapidly changing mRNA pool, and selecting 
35 a small sample which ensures that a minimal number of doubles will be present. In 
most embodiments, the presence of doubles merely adds an additional source of noise 
or, in the case of sequencing, a minor complication in scanning and signal processing, 
as microparticles giving multiple fluorescent signals can simply be ignored. 
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As used herein, the term "substantially all" in reference to attaching tags to 
molecules, especially polynucleotides, is meant to reflect the statistical nature of the 
sampling procedure employed to obtain a population of tag-molecule conjugates 
essentially free of doubles. The meaning of substantially all in terms of actual 
5 percentages of lag-molecule conjugates depends on how the tags are being employed. 
Preferably, for nucleic acid sequencing, substantially all means that at least eighty 
percent of the polynucleotides have unique tags attached. More preferably, it means 
that at least ninety percent of the polynucleotides have unique tags attached. Still 
more preferably, it means that at least ninety-five percent of the polynucleotides have 

1 0 unique tags attached. And, most preferably, it means that at least ninety-nine percent 
of the polynucleotides have unique tags attached. 

Preferably, when the population of polynucleotides consists of messenger 
RN A (mRNA), oligonucleotides tags may be attached by reverse transcribing the 
mRNA with a set of primers preferably containing complements of tag sequences. 

1 5 An exemplary set of such primers could have the following sequence (SEQ ID NO: 
9): 

5 '-mRNA- (AJn -3' 

[T] 19GG [W, W, W, Cj 9ACCAGCTGATC-5 ' -biotin 

20 where "[W,W,W,C]9" represents the sequence of an oligonucleotide tag of nine 

subunits of four nucleotides each and "[W,W,W,C]" represents the subunit sequences 
listed above, i.e. "W" represents T or A. The underlined sequences identify an 
optional restriction endonuclease site that can be used to release the polynucleotide 
from attachment to a solid phase support via the biotin, if one is employed. For the 

25 above primer, the complement attached to a microparticle could have the form: 

5'- [G,W,W, W) gTGG-linker-microparticle 

After reverse transcription, the mRNA is removed, e.g. by RNase H digestion, 
30 and the second strand of the cDNA is synthesized using, for example, a primer of the 
following form (SEQ ID NO: 1 0): 

5'-NRRGATCYNNN-3' 

35 where N is any one of A, T, G, or C; R is a purine-containing nucleotide, and Y is a 
pyrimidine-containing nucleotide. This particular primer creates a Bst Yl restriction 
site in the resulting double soanded DN A which, together with the Sal I site, 
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facilitates cloning into a vector with, for example, Bam HI and Xho I sites. After Bsl 
Yl and Sal I digestion, the exemplary conjugate would have the form: 

5'-RCGACCA(C,W,W,W]9GG[T]i9- cDNA -NNNR 
5 GGT[G,W, W,W] 9CC[A)i9- rDNA -NNNYCTAG-5' 

The polynucleotide-tag conjugates may then be manipulated using standard molecular 
biology techniques. For example, the above conjugate- which is actually a mixmre- 
may be inserted into commercially available cloning vectors, e.g. Stratagene Cloning 
1 0 System (La Jolla, CA); transfected into a host, such as a commercially available host 
bacteria; which is then cultured to increase the number of conjugates. The cloning 
vectors may then be i.solated nsino <5tanHarH t*»rhnmtt^»*: ^» o ^amKmrtU a« oI 
Molecular Cloning, Second Edition (Cold Spring Harbor Laboratoiy, New York, 
1989). Alternatively, appropriate adaptors and primers may be employed so that the 
1 5 conjugate population can be increased by PCR. 

Preferably, when the ligase-based method of sequencing is employed, the Bst 
. Yl and Sal I digested fragments are cloned into a Bam HI-/Xho I-digested vector 
having the following single-copy restriction sites (SEQ ID NO: 1 1): 

20 

5 • -GAGGATGCCTTTAT GGATCC A CTCGAG ATCCCAATCCA-,-^ • 
Fokl BamHI Xhol 

This adds the Fok I site which will allow initiation of the sequencing process 
25 discussed more fully below. 

Tags can be conjugated to cDNAs of existing libraries by standard cloning 
methods. cDNAs are excised from their existing vector, isolated, and then ligated into 
a vector containing a repertoire of tags. Preferably, the tag-containing vector is 
linearized by cleaving with two restriction enzymes so that the excised cDNAs can be 
30 ligated in a predetermined orientation. The concentration of the linearized tag- 
containing vector is in substantial excess over that of the cDNA inserts so that 
lij- uion provides aninherent sampling of tags. 

A general method for exposing the single stranded tag after amplification 
involves digesting a polynucleoiide-containing conjugate with the 5'->3' exonuclease 
35 activity of T4 DNA polymerase, or a like enzyme. When used in the presence of a 
single deoxynucleoside triphosphate, such a polymerase will cleave nucleotides from 
3' recessed ends present on the non-template strand of a double stranded fragment 
until a complement of the single deoxynucleoside triphosphate is reached on the 



-39- 



wo 97/32999 PCTAJS96/18708 

template strand. When such a nucleotide is reached the 5'-»3' digestion effectively 
ceases, as the polymerase's extension activity adds nucleotides at a higher rate than 
the excision activity removes nucleotides. Consequently, single stranded tags 
constructed with three nucleotides are readily prepared for loading onto solid phase 
5 supports. 

The technique may also be used to preferentially methylate interior Fok I sites 
of a polynucleotide while leaving a single Fok 1 site at the terminus of the 
polynucleotide unmethylated. First, the terminal Fok I site is rendered single stranded 
using a polymerase with deoxycytidine triphosphate. The double stranded portion of 

1 0 the fragment is then methylated, after which the single stranded terminus is filled in 
with a DNA polymerase in the presence of all four nucleoside triphosphates, thereby 
regenerating the Fok I site. Clearly, Lhis procedure can be generalized to 
endonucleases other than Fok 1. 

After the oligonucleotide tags are prepared for specific hybridization, e.g. by 

1 5 rendering them single suanded as described above, the polynucleotides are mixed 
with microparticles containing the complementary sequences of the tags under 
conditions that favor the formation of perfectly matched duplexes between the tags 
and their complements. There is extensive guidance in the literature for creating these 
conditions. Exemplary references providing such guidance include Wetmur, Critical 

20 Reviews in Biochemistry and Molecular Biology, 26: 227-259 (1991 ); Sambrook et 
al, Molecular Cloning: A Laboratory Manual, 2nd Edition (Cold Spring Harbor 
Laboratory, New York, 1989); and the like. Preferably, the hybridization conditions 
are sufficiently stringent so that only perfectly matched sequences form stable 
duplexes. Under such conditions the polynucleotides specifically hybridized through 

25 their tags may be ligated to the complementary sequences attached to the 

microparticles. Finally, the microparticles are washed to remove polynucleotides with 
unligated and/or mismatched tags. 

When CPG microparticles conventionally employed as synthesis supports are 
used, the density of tag complements on the microparticle surface is typically greater 

30 than that necessary for some sequencing operations. That is, in sequencing 

approaches that require successive treatment of the attached polynucleotides with a 
variety of enzymes, densely spaced polynucleotides may tend to inhibit access of the 
relatively bulky enzymes to the polynucleotides. In such cases, the polynucleotides 
are preferably mixed with the microparticles so that tag complements are present in 

35 significant excess, e.g. from 10:1 to 100:1, or greater, over the polynucleotides. This 
ensures that the density of polynucleotides on the microparticle surface will not be so 
high as to inhibit enzyme access. Preferably, the average inter-polynucleotide spacing 
on the microparticle surface is on the order of 30-100 nm. Guidance in selecting 
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ratios for standard CPG supports and Ballotini beads (a type of solid glass support) is 
found in Maskos and Southern, Nucleic Acids Research, 20: 1679-1684 (1992), 
Preferably, for sequencing applications, standard CPG beads of diameter in the range 
of 20-50 [im are loaded with about 10^ polynucleotides, and GMA beads of diameter 
5 in the range of 5-10 are loaded with a few tens of thousand polynucleotide, e.g. 4 
X 104 to6x lOf 

In the preferred embodiment, tag complements are synthesized on 
microparticles combinatorially; thus, at the end of the synthesis, one obtains a 
complex mixture of microparticles from which a sample is taken for loading tagged 

10 polynucleotides. The size of the sample of microparticles will depend on several 
factors* including the size of the repertoire of tag complements, the nature of the 
apparatus for used for observing loaded microparticles~e.g. its capacity, the tolerance 
for multiple copies of microparticles with the same lag complement (i.e. "bead 
doubles"), and the like. The following table provide guidance regarding 

1 5 microparticle sample size, microparticle diameter, and the approximate physical 
dimensions of a packed array of microparticles of various diameters. 

Microparticle diameter 5 10 ^m 20 40 ^m 

Max. no. 

polynucleotides loaded 

atlperlO^sq. 3 x 10^ 1.26 xlO* 5x10^ 

angstrom 

Approx. area of 
monolayer of 10^ 

microparticies .45 x .45 cm ) x 1 cm 2 x 2 cm 4 x 4 cm 

20 The probability that the sample of microparticles contains a given tag complement or 
is present in multiple copies is described by the Poisson distribution, as indicated in 
the following table. 

25 . - - _ 



30 
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Table VII 



Number of 
microparticles in 
sample (as fraction 
of repertoire size), 



m 



repertoire of tag 



complements 
present in 
sample. 



Fraction of 



Fraction of 
microparticles in 
sample with unique 
tag complement 
anached, 
m(e-"^)/2 



Fraction of 
microparticles in 
sample carrying 
same lag 
complement as one 
other micropanicle 

in sample 
("bead doubles"), 
m^(e'"')/2 



1. 000 
.693 
.405 
.285 
.223 
.105 
.010 



0.63 
0.50 
0.33 
0.25 
0.20 
0.10 
0.01 



0.37 
0.35 
0.27 
0.21 
0.18 
0.09 
0.0 i 



0.18 
0.12 
0.05 
0.03 
0.02 
0.005 



Apparatus for Observing Detection Signals 
at Spatially Addressable Sites 



10 



15 



20 



Preferably, a spatially addressable array is established by fixing microparticle 
containing tag complements to a solid phase surface. 

Preferably, whenever light-generating signals, e.g. chemiluminescent, 
fluorescent, or the like, are employed to detect tags, microparticles are spread on a 
planar substrate, e.g. a glass slide, for examination with a scanning system, such as 
described in International patent applications PCT/US9 1/092 17, PCT/NL90/00081, 
and PCT/US95/01886. The scamiing system should be able to reproducibly scan the 
substrate and to define the positions of each microparticle in a predetermined region 
by way of a coordinate system. In polynucleotide sequencing applications, it is 
important that the positional identification of microparticles be repeatable in 
successive scan steps. 

Such scanning systems may be constructed from commercially available 
components, e.g. x-y translation table controlled by a digital computer used with a 
detection system comprising one or more photomultiplier tubes, or alternatively, a 
CCD array, and appropriate optics, e.g, for exciting, collecting, and sorting 
fluorescent signals. In some embodiments a confocal optical system may be 
desirable. An exemplary scanning system suitable for use in four-color sequencing is 
illustrated diagrammatically in Figure 3, Substrate 300, e.g. a microscope slide with 
fixed microparticles, is placed on x-y translation table 302, which is connected to and 
controlled by an appropriately programmed digital computer 304 which may be any of 
a variety of commercially available personal computers, e.g. 486-based machines or 
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PowerPC model 7100 or 8100 available form Apple Computer (Cupertino, CA). 
Computer software for table translation and data collection functions can be provided 
by commercially available laboratory software, such as Lab Windows, available from 
National Instruments. 

5 Substrate 300 and table 302 are operationally associated with microscope 306 

having one or more objective lenses 308 which are capable of collecting and 
delivering light to microparticles fixed to substrate 300. Excitation beam 310 from 
light source 312, which is preferably a laser, is directed to beam splitter 314, e.g. a 
dichroic mirror, which re-directs the beam through microscope 306 and objective lens 
10 308 which, in turn, focuses the beam onto substrate 300. Lens 308 collects 

fluorescence 316 emitted from the microparticles and directs it through beam splitter 
314 to signal distribution optics 318 which, in turn, directs fluorescence to one or 
more suitable opto-electronic devices for converting some fluorescence characteristic, 
e.g. intensity, lifetime, or the like, to an electrical signal. Signal distribution optics 
15 318 may comprise a variety of components standard in the art, such as bandpass 
filters, fiber optics, rotating mirrors, fixed position mirrors and lenses, diffraction 
gratings, and the like. As illustrated in Figure 5, signal distribution optics 3 1 8 directs 
fluorescence 316 to four separate photomuhiplier tubes, 330, 332, 334, and 336, 
whose output is then directed to pre-amps and photon counters 350, 352, 354, and 
20 356. The output of the photon counters is collected by computer 304, where it can be 
stored, analyzed, and viewed on video 360. Alternatively, signal distribution optics 
3 1 8 could be a diffraction grating which directs fluorescent signal 3 1 8 onto a CCD 
array. 

The stability and reproducibility of the positional localization in scanning will 
25 determine, to a large extent, the resolution for separating closely spaced 

microparticles. Preferably, the scanning systems should be capable of resolving 
closely spaced microparticles, e.g. separated by a particle diameter or less. Thus, for 
most applications, e.g. using CPG microparticles, the scanning system should at least 
have the capability of resolving objects on the order of 5- 1 00 nm. Even higher 
30 resolution may be desirable in some embodiments, but with increase resolution, the 
time required to fiilly scan a substrate will increase; thus, in some embodiments a 
compromise may-have to be made between-speed and resolution. Increases in 
scanning time can be achieved by a system which only scans positions where 
microparticles are known to be located, e.g from an initial fiill scan. Preferably, 
35 microparticle size and scanning system resolution are selected to permit resolution of 
fluorescently labeled microparticles randomly disposed on a plane at a density 
between about ten thousand to one hundred thousand microparticles per cm2. 
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In sequencing applications, microparlicles can be fixed to the surface of a 
substrate in variety of ways. The fixation should be strong enough to allow the 
microparticles to undergo successive cycles of reagent exposure and washing without 
significant loss. When the substrate is glass, its surface may be derivatized with an 
5 alkylamino linker using commercially available reagents, e.g. Pierce Chemical, which 
in turn may be cross-linked to avidin, again using conventional chemistries, to form 
an avidinated surface. Biotin moieties can be introduced to the microparticles in a 
number of ways. 

In an alternative, when DNA-loaded microparticles are applied to a glass 
10 substrate, the DNA nonspecifically adsorb to the glass surface upon several hours, e.g. 
24 hours, incubation to create a bond sufficiently strong to permit repeated exposures 
tr. rpscenis M/QcK**c vi;ithniit sianificant loss of microparticles. Such a glass 
substrate may be a flow cell, which may comprise a channel etched in a glass slide. 
Preferably, such a channel is closed so that fluids may be pumped through it and has a 
1 5 depth sufficiently close to the diameter of the microparticles so that a monolayer of 
microparticles is trapped within a defined observation region. 

Kits for Implementing the Method of the Invention 
The invention includes kits for carrying out the various embodiments of the 

20 invention. Preferably, kits of the invention include a repertoire of lag complements 
attached to a solid phase support. Additionally, kits of the invention may include the 
corresponding repertoire of tags, e.g. as primers for amplifying the polynucleotides to 
be sorted or as elements of cloning vectors. Preferably, the repertoire of tag 
complements are attached to microparticles. Kits may also contain appropriate 

25 buffers for enzymatic processing, detection chemisuies, e.g. fluorescent or 

chemiluminescent components for labelling tags, and the like, instructions for use, 
processing enzymes, such as ligases, polymerases, transferases, and so on. In an 
important embodiment for sequencing, kits may also include substrates, such as a 
avidinated microscope slides or microliter plates, for fixing microparticles for 

30 processing. 

Example 1 

Consutiction of a Tag Library 
An exemplary tag library is constructed as follows to form the chemically 
35 synthesized 9- word tags of nucleotides A, G, and T defined by the formula: 

3'-TGGC-l^(A,G,T)9]-CCCCp 
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where "['*({A,G,T)9]" indicates a tag mixture where each tag consists of nine 4-mer 
words of A, G, and T; and "p" indicate a 5* phosphate. This mixture is ligated to the 
following right and left primer binding regions (SEQ ID NO: 12 & 13): 

5 5' - AGTGGCTGGGCATCGGACCG 5'- GGGGCCCAGTCAGCGTCGAT 

TCACCGACCCGTAGCCp GGGTCAGTCGCAGCTA 

LEFT RIGHT 

1 0 The right and left primer binding regions are ligated to the above tag mixture, after 
which the single stranded portion of the ligated structure is filled with DNA 
polymerase then mixed with the right and left primers indicated below and amplified 
to give a tag librar)'. 

1 5 Left primer: 

5 • - AGTGGCTGGGCATCGGACCG 

5 ' - AGTGGCTGGGCATCGGACCG- [ ^ ( (A, G, T) 9 j - GGGGCCCAGTCAGCGTCGAT 

20 TCACCGACCCGT AGCCTGGC- [^*(A,G,T)9]- CCCCGGGTC AGTCGCAGCTA 

CCCCGGGTCAGTCGCAGCTA-5 • 

Right primer 

25 

The imderlined portion of the left primer binding region indicates a Rsr II recognition 
site. The left-most underlined region of the right primer binding region indicates 
recognition sites for Bsp 1201, Apa I, and Eco 0 1091, and a cleavage site for Hga I. 
The right-most underlined region of the right primer binding region indicates the 
30 recognition site for Hga 1. Optionally, the right or left primers may be synthesized 
with a biotin attached (using conventional reagents, e.g. available from Clontech 
Laboratories, Palo Alto, CA) to facilitate purification after amplification and/or 
cleavage. 

35 Example 2 

- - - - Construction of a Plasmid Library of Tag-Polvnucleotide 
Conjugates for cDNA "Signature" Sequencing 
cDNA is produced fi-om an mRNA sample by conventional protocols using 
pGGCCCTi5(A or G or C) as a primer for first strand synthesis anchored at the 
40 boundary of the poly A region of the mRNAs and NgCA or T)GATC as the primer for 
second strand synthesis. That is, both are degenerate primers such that the second 
strand primer is present in two forms and the first strand primer is present in three 
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forms. The GATC sequence in the second strand primer corresponds to the 
recognition site of Mbo I; other four base recognition sites could be used as well, such 
as those for Bam HI, Sph 1, Eco RJ, or the like. The presence of the A and T adjacent 
to the restriction site of the second strand primer ensures that a stripping and exchange 
5 reaction can be used in the next step to generate a five-base 5' overhang of "GGCCC". 
The first strand primer is annealed to the mRNA sample and extended with reverse 
transcriptase, after which the RNA strand is degraded by the RNase H activity of the 
reverse transcriptase leaving a single stranded cDNA. The second strand primer is 
annealed and extended with a DNA polymerase using conventional protocols. After 

10 second strand synthesis, the resulting cDNAs are methylated with CpG methylase 
(New England Biolabs, Beverly, MA) using manufacturer's protocols. The 3' strands 
of the cDNAs are then cut back with the above-mentioned stripping and exchange 
reaction using T4 DNA polymerase in the presence of dATP and dTTP, after which 
the cDNAs are ligated to the tag library of Example 1 previously cleaved with Hga I 

1 5 to give the following construct: 



5'-biotin- primer binding site TAG Apa I — cDNA 



t T 
Rsr II site Mbo I site 

20 Separately, the following cloning vector is constructed, e.g. starting from a 

commercially available plasmid, such as a Bluescript phagemid (Stratagene, La Jolla, 
CA)(SEQID NO: 14). 

T primer binding site Ppu MI site 

i i 
25 (plasmid) -5' -AAAAGGAGGAGGCCTTGATAGAGAGGACCT GTTTAAAC- 

-TTTTCCTCCTCCGGAACTATCTCTCCTGGA CAAATTTG- 




30 S primer binding site 

i 

-GTTTAAAC-GGATCC GCTGCTTTTTCCTCCTCC- 3 * - (plasmid) 
-CAAATTTG- CCTAGG CGACGAAAAAGGAGGAGG- 

t t T 

35 Bam HI site 

Pme I site Bbv I site 

The plasmid is cleaved with Ppu MI and Pme I and then methylated with DAM 
methylase. The tag-containing construct is cleaved with Rsr U and then ligated to the 
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open plasmid, after which the conjugate is cleaved with Mbo I and Bam HI to permit 
ligation and closing of the plasmid. The plasmid is then amplified and isolated for 
use as a template in the selective amplifications using the S and T primers. 

5 Example 3 

Signature Sequencing of a cDNA Library Using S and T Primers 
The plasmid constructed in Example 2 is used as a base library for generating 
amplicons with S and T primers. The following T primer is employed (SEQ ID NO: 
15): 



10 



25 



biotin-5 ' -IIIIIIIIAAAAGGAGGAGGCCTTGA 



where the I's are deoxyinosines added to balance the annealing and melting 
temperatures of the S and T primers. The following 32 S primers are employed (SEQ 
15 ID NO: 16-23): 

XCG AC G AAAAAGGAGG AGG IIIIIII-5' 
XNCGACGAAAAAGGAGGAGG I I I I I I 
XNBCGACGAAAAAGGAGGAGG I I I I I 
20 XNBBCGACGAAAAAGGAGGAGGIIII 

XNBBBCGACGAAAAAGGAGGAGGI I I 
XNBBBBGGACGAAAAAGGAGGAGGII 
XNBBBBBCGACGAAAAAGGAGGAGGI 
XNBBBBBBCGACGAAAAAGGAGGAGG 



Bbv I site 



where X represents one of the four nucleotides. A, C, G, and T so that each of the 
above sequences represents four different mixtures of S primers, and where N 
30 represents a mixture of the four natural nucleotides and B represents a mixture of I 
and C, which serve as dengeneracy-reducing analogs. Clearly, other spacer nucleotide 
may also be used. 

The plasmid DNA is methylated with Hae III methylase and dispersed into 32 
separate vessels, e.g. wells of a microtiter plate, where PCRs using the above primers 
35 are performed. Preferably, the reactions are arranged in 8 rows of 4 reactions: one 
row for each nucleotide position being interrogated and one column for each 3' 
terminal nucleotide of the S primer. 8 rows are required because the reach of Bbv I is 
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(8/12), so that after 8 nucleotides are identified by the above PCRs, the nucleotides 
are remove from the cDNA by Bbv I cleavage. 

After PCR amplification, the amplicons fi-om each reaction are separately 
captured on magnetic beads carrying a single stranded sequence that.forms a triplex 
5 with the S primers. The beads are then transferred to reaction mixtures containing 
Apa I, which cleaves all strands not containing methyl groups, i.e. all the strands that 
have been selectively amplified. Thd" released strands are next captured via their 
biotinylated T primers with magnetic beads coated with avidin and transferred to 
reaction vessels where their 3* ends are stripped in the presence of T4 DNA 
1 0 polymerase and dGTP, as shown below: 

After cleavage with Apa I (SEQ ID NO: 24): 

biotin-5*- IIIIIIII [AG]12TAGAGAGGACCG( TAGS ]GGGGCC 
15 CCCCCCCC [TC] 12ATCTCTCCTGGC [ SCAT ] CC 



20 



T4 polymerase + dGTP 



biotin-5'- IIIIIIII [AG] 12TAGAGAGGACCG[ TAGS ] GG 

GGC[ SGAT ]CC 



25 i AdddUTP*,dCTP&ddATP 



biotin-5' - IIIIIIII [AG] 12TAGAGAGGACCG[ TAGS ]GG 

dAUCUCUCCUGGC [ SGAT ] CC 
30 ★ * * * 



Heat denature 



35 dAUCUCUCCUGGC [ SGAT ] CC-5 ' (SEQ ID NO 25) 

★ * ★ ★ 

Here dUTP* represents a labeled dUTP and ddATP represents dideoxyadenosine 
triphosphate. Preferably, dUTP is labeled with a separate spectrally resolvable 
40 fluorescent dye for each of the four coiunui PCRs. The released tags for each of the 
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eight row PCRs are mixed and are applied to the spatially addressable array for 
hybridization lo their complements and detection. 

After, or concurrently with, the 32 PCRs, Bbv I is used to shorten the cDNA 
inserts of the library. An amplicon is produced from a fresh sample of plasmid using 
5 the following S and T primers (SEQ ID NO: 26 & 27): 

T primer: 5 * -AAAAGGAGGAGGCCTTGA 

S primer: 3 • -CGACGAAAAAGGAGGAGG-biot in 

10 

After amplification, the amplicon is methylated to protect internal Bbv 1 sites, its 3' 
ends are stripped using T4 DNA polymerase and dGTP, after which the recessed 
strands are filled in by the addition of dTTP and dCTP. The amplicon is then cleaved 
with Bbv I and the S primer segment is removed with magnetic beads coated with 
1 5 avidin. The following adaptor mixture containing a new S primer binding site is then 
ligated to the T primer segment: , 

5 ' - GGAGGAGGAAAAAGCAGC 

CCTCCTCCTTTTT CGTCG NNNN 

20 

The ligated fragment is then amplified for the next cycle of selective amplifications 
with the 32 S primers described above. 

25 



30 



35 



.49- 



wo 97/32999 



PCT/US96/18708 



Example 4 

Signature Sequencing of a cDNA Library Using Rolling Primers 
A plasmid library of lag-polynucleotide conjugates is constructed as described 
in Example 2, except that the primer binding region is constructed as follows: 



T primer binding site 

i 



Ppu MI site 
i 



10 



15 



(plasmid) -5' -AAAAGGAGGAGGCCTTGATAGAGAGGACCT GTTTAAAC- 
-TTTTCCTCCTCCGGAACTATCTCTCCTGGA CAAATTTG- 





20 



rolling primer binding site 

i 

GTTTAAAC-GGATCC-TCTTCCTCTTCCTCTTCC-3 ' - (plasmid) 
CAAATTTG-CCTAGG-AGAAGGAGAAGGAGAAGG- 
t t 

Bam H 1 site 

Pme I site 



The rolling primer binding site corresponds to a rolling primer of subgroup (1 ), 
described above. As with Example 2, the above plasmid is cleaved with Ppu MI and 

25 Pme I (to give a Rsr Il-compatible end and a flush end so that the insert is oriented) 
and then methylated with DAM methylase. The tag-containing construct (described 
in Example 2) is then cleaved with Rsr II and then ligated to the open plasmid, after 
which the conjugate is cleaved with Mbo I and Bam HI to permit ligation and closing 
of the plasmid. The plasmid is then amplified and isolated for use as a template for 

30 extensions and amplifications with rolling primers. 



Example S 

- - Signature Sequencing of acDNA Library with Rolling Primers 

35 The plasmid constructed in Example 2 is used for generating extension 

products and amplicons with the rolling primers described above and the following T 
primer (SEQ ID NO: 15): 



40 



biot in- 5 ' - I I I I I I I I AAAAGGAGGAGGCCTTGA 
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where the I's are deoxyinosines added to balance the annealing and melting 
temperatures of the T primers and rolling primers. Preferably, the annealing 
temperature is about 55^0. Clearly, many other sequences could be employed in the 
implementation of the invention. The rolling primers described above are employed. 
5 The segment containing the T primer binding site through the rolling primer 

binding site is excised and separated from the plasmid of example 2. (This can be 
accomplish in a variety of ways know to those skilled in the art, for example, 
engineering the plasmid to contain restriction sites flanking the segment, or by simply 
amplifying directly by PCR). After replacing deoxyguanosines with deoxyinosines, 

1 0 e.g. by PCR in the presence of dITP, the segment is aliquoned into four vessels, 
denatured, and the appropriate rolling primer is added. Conditions are adjusted to 
permit the rolling primers to anneal, after which the primers are extended vnih 
Sequenase, or like high fidelity polymerase, in the presence of dATP, dCTP, dlTP, 
and dTTP. using the manufacturer's protocol. The remaining single stranded DNA is 

1 5 digested with a single stranded nuclease, such as Mung bean nuclease. Optionally, the 
double stranded DNA extension product may be separated from the reaction mixture, 
e.g. by capture via the formation of a triplex between, for example, the T primer 
binding region, and an appropriate single stranded complement attached to a magnetic 
bead. 

20 The double suanded DNA is combined with T primer (and rolling primer if a 

separation step was used) and amplified by 5-10 cycles of PCR in the presence of 
dATP, dCTP, dITP, and dTTP to form the four initial amplicons. Samples of these 
are combined and re-distributed into vessels with the appropriate rolling primers for 
the next cycle of extension. Samples are also drawn off for analysis as described in 

25 Example 3. 
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APPENDIX la 

Exemplary computer program for generating 
minimally cross hybridizing sets 
(single stranded tag/single stranded tag complement) 



Program minxh 

c 



c 
c 



integer* 2 subl (6) ,msetl (1000, 6) ,mset2 (1000, 6) 
dimension nbase(6) 



write(*, *) 'ENTER SUBUNIT LENGTH' 



100 format (il) 

open (1, file^'sub4 .dat ' , form=' formatted' , status='new' 



c 
c 

nset^O 

do 7000 ml=l, 3 
do 7000 m2-l, 3 
do 7000 m3=l, 3 
do 7000 m4=l, 3 
subl (l)=ml 
subl (2) =m2 
subl (3)«m3 
subl ( 4 ) =m4 

c 

ndiff=3 

c 

c Generate set of subunits differing from 

c subl by at least ndiff nucleotides, 

c Save in mset 1 . 

c 



do 900 j=l,nsub 
900 msetl (1, j )=subl ( j ) 



do 1000 kl=l,3 
do 1000 k2«l,3 
do 1000 k3=l, 3 
do 1000 k4=l, 3 



nbase (l)-kl 
nbase (2 ) «"k2 
nbase ( 3 ) =k3 
nbase (4 ) =k4 



n«0 

do 1200 j=l,nsub 

if (subl ( j ) . eq . 1 ,and. nbase ( j ). ne . 1 .or. 
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1 subl{j).eq.2 .and. nbase ( j ) . ne . 2 .or. 

3 subl{3).eq.3 .and. nbase { j ) . ne . 3 : then 

n=n-f 1 

endi f 

1200 continue 



(n.ge.ndif f ) then 



If number of mismatches 
is greater than or equal 
to ndiff then record 
subunit in matrix mset 



do 1100 i=l,nsub 
1100 rasetl ( j j, i)=nbase(i) 

endif 

c 
c 

1000 continue 

c 

c 

do 1325 j2«l,nsub 

mset 2 (1, j2)«msetl (1, j2) 
1325 mset2 (2, j2)«msetl (2, j2) 

c 

c Compare subunit 2 from 

c msetl with each successive 

c subunit in msetl, i.e. 3, 

c 4,5, ... etc. Save those 

c with mismatches .ge. ndiff 

c in matrix mset2 starting at 

c position 2 . 

c Next transfer contents 

c of mset2 into msetl and 

c start 

c comparisons again this time 



c starting with subunit 3- 

c Continue until all subunits 

c undergo the comparisons, 

c 

npass=0 

c 
c 

1700 continue 

kk=npass+2 
npass=npass+l 

c 
c 

do 1500 m=npass+2,jj 
n=0 

do 1600 j=l,nsub 

if (msetl (npass + 1, j 1 .eq. Land. msetl (m, j ) .ne. 1 .or. 
2 msetl (npass+1 , j ) .eq. 2.and. msetl (m, j ) .ne.2 . or . 

2 msetl (npass+1, j ) .eq. 3.and.msetl (m, j ) .ne. 3) then 

n*n+l 
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1600 

1625 

1500 
c 
c 
c 

c 

2000 

c 
c 

7009 

7008 
7010 

120 
7000 

c 

c 
c 
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endif 
continue 
i f m . ge . ndif f ) then 
kk=kk+l 

do 1625 i=l , nsub 

mset2 { kk, i ) =mset i (m, i ) 

endi f 
continue 

kk is the number of subunits 
stored in mset2 

Transfer contents of mset2 
into msetl for next pass. 



do 2000 k=l, kk 

do 2000 m=l,nsub 

mset 1 ( k, m) =mset2 ( k, m) 
if (kk.lt. jj) then 
jj = kk 
goto 1700 
endif 



nset=nset+l 
write (1, 7009) 

format ( / ) 
do 7008 k=l,kk 

write (1, 7010) (msetl ( k, ro ) , m=l , nsub) 
format ( 4il) 
write (*, * ) 

write ( *, 120) kk, nset 

format (Ix, 'Subunits in set= ' , i 5, 2x, ' Set No=\i5) 
continue 
close ( 1 ) 



end 
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APPENDIX lb 

Exemplary computer program for generating 
minimally cross hybridizing sets 
(single stranded tag/single stranded tag complement) 



Program tagN 
c 

c Program tagN generates minimally cross-hybridizing 

c sets of subunits given i) N — subunit length, and ii) 

- an initial subunit sequence. tagN assumes that only 

c 3 of the four natural nucleotides are used in the caos . 

character*! subl(20) 

integer*2 mset { 10000 , 20) , nbase(20) 

c 
c 

write(*, *) 'ENTER SUBUNIT LENGTH' 

read(*, 100)nsub 
100 format (12) 

c 
c 

write {*,*) 'ENTER SUBUNIT SEQUENCE' 

read(*, 110) (subl (k) , J<=l,nsub) 
110 format (20al) 

c 
c 

ndiff=10 

c 
c 

c Let a=l c=2 g=3 & t«4 



do 800 k)c=l,nsub 

if (subl (kk) .eq. 'a' ) then 

mset (1, kk)=l 

endif 

if (subl (kk) .eq. 'c' ) then 
mset (1, kk)*2 
endif 

if (subl (kk) .eq. 'g' ) then 
mset (1, kk) =3 
endif 

if (subl(kk) .eq. 't ' ) then 
mset (1, kk) «4 
- - endif 

800 continue 

c 
c 

c Generate set of subunits differing from 

c subl by at least ndiff nucleotides, 

c 
c 

jj-l 

c 
c 

do 1000 kl-1,3 
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c 



do 1000 k2=l, 3 
do 1000 k3=l, 5 
do 1000 k4=l,3 
do 1000 :<5=1,3 
do 1000 k6=l, 3 
do lOCO K7=l,3 
do 1000 k8=l,3 
dc 1000 k9=l,3 
do 1000 klO«l, 3 

dc 1000 kll=l, 3 
do 1000 kl2=l, 3 
do 1000 kl3=l, 3 
do 1000 kl4=l,3 
do 1000 kl5=l,3 
do 1000 kl6=l,3 
do 1000 kl7=l,3 
do 1000 kl8=l, 3 
do 1000 kl9=l, 3 
do 1000 k20=l, 3 



nbase ( 1 ) = 
nbase{2)= 
nbase ( 3) = 
nbase (4) - 
nbase ( 5 ) = 
nbase (6) = 
nbase ( 7 ) = 
nbase (8 ) = 
nbase(9)= 
nbase ( 10 ) 
nbase ( 11 ) 
nbase(12) 
nbase (13) 
nbase ( 14 ) 
nbase ( 15 ) 
nbase (16) 
nbase (17) 
nbase ( 18 ) 
nbase(19) 
nbase (20) 



kl 
k2 
k3 
k^ 
kb 
k6 
k7 
k8 
k9 

= klO 
= kll 
= kl2 
= kl3 
-kl4 
= kl5 
= kib 
= kl7 
= kl8 
= kl9 
-k20 



do 1250 nn=l, j j 



1200 

c 

c 



1250 

c 

c 



n=0 

do 1200 j=l,nsub 

if (mset (nn, j ) . eq . 1 
mset (nn, j ) . eq. 2 
mset (nn, j ) . eq . 3 
mset (nn, j ) . eq . 4 
n=n+l 
endif 
continue 



if (n. it 
goto 
endif 

continue 



ndif f ) 
1000 



then 



write ( * , 
do 1100 



.and. nbase ( j ). ne . 1 .or. 

.and. nbase ( j ) . ne . 2 .or. 

.and. nbase ( j ) . ne . 3 .or. 

. and . nbase ( j ) . ne . 4 ) then 



130) (nbase (i ), i=l, nsub) , jj 
i= 1 , nsub 
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mset ( j j , i ) =nbase f i ) 
HOC continue 

c 

1000 continue 



write ( * / * ) 
130 format { lOx, 20 ( Ix, i 1 ) , Sx, i 5 ) 

write(*,*) 

write r , 120) j j 
120 format ( Ix, ' Number of words=',i5) 

c 
c 



c 



-57- 



wo 97/32999 



PCT/US96/I8708 



APPENDIX Ic 

Exemplary^ computer program for generating 
minimally cross hybridizing sets 
(double stranded tag/single stranded tag complement) 

Program 3tagN 
c 

c Program 5tagN generates minimally cross-hybridizing 

c sets of duplex subunits given i) N--subunit length, 

c and ii) an initial horaopurine sequence. 

character*! subl (20) 

integer*2 mset ( 10000, 20) , nbase(20) 

c 

write 'ENTER SUBUNIT LENGTH' 
read (*, 100) nsub . 
100 format (i2) 

write(*, 'ENTER SUBUNIT SEQUENCE a & q only' 

read(*, 110) (subl (k) , k=l,nsub) 
110 format (20al) 

c 

ndiff^lO 

c 

c Let a=l and g=2 

c 

do 800 kk=l,nsub 

if (subl (kk) .eq. ' a ' ) then 

mset (1, kk)==l 

endif 

if (subl ( kk) . eq. 'g * ) then 
mset (1, kk)=2 
endi f 

800 cont inue 



do 1000 ki=l,3 
do 1000 k2=l,3 
do 1000 k3=l,3 
do 1000 k4=l, 3 
do 1000 k5=l,3 
do 1000 k6=l,3 
do 1000 k7=l, 3 
do 1000 k8=l,3 
do 1000 k9=l, 3 
do 1000 klO=l/3 

do 1000 kll=l,3 . 
do 1000 kl2=l, 3 
do r000'kl3-l, 3 
do 1000 kl4=l,3 
do 1000 kl5=l,3 
do 1000 kl6=l, 3 
do 1000 kl7=l, 3 
do 1000 kl8-l, 3 
do 1000 kl9=l,3 
do 1000 k20-l, 3 



nbase ( 1 ) =kl 
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nbase(2)=k2 
nbase ; 3 ) =k3 
nbase ( 4 ) =k4 
nbase (5>=k5 
nbase ( 6) -k6 
nbase ( 7 ) =k7 
nbase(8)=k6 
nbase(9)=k9 
nbase (10) =klO 
nbase(ll)=kll 
nbase(12)=kl2 
nbase{13)=kl3 
nbase(14)=kl4 
nbase(15)=kl5 
nbase (16)=kl6 
nbase (17)=kl7 
nbase(18)=kl8 
nbase(19)=kl9 
nbase (20) =k20 



do 1250 nn-1, j j 



1200 

c 

c 



1250 
c 



1100 
c 

1000 
c 

130 



120 

c 

e 



n=0 

do 1200 j^l^nsub 

if (mset (nn, j ) . eq . 1 
msec (nn, j ) . eq . 2 
mset (nn, j ) . eq . 3 
mset (nn, j ) . eq . 4 
n=n+l 
endif 
continue 



.and. nbase ( j ). ne . 1 .or. 

.and. nbase ( j ) ,ne. 2 .or, 

.and. nbase ( j ) .ne. 3 .or. 

.and. nbase ( j ) .ne. 4 ) then 



if (n.lt.ndiff) then 

goto 1000 

endif 
continue 

write f * , 130) (nbase ( i ) , i = l , nsub) , j j 
do 1100 i=l,nsub 

mset ( j j , i ) ^nbase ( i } 
continue 

continue 
write (*, *) 

format ( lOx, 20 (Ix, il ) , 5x, iS) 
' write (*, *) 
write(*,120) jj 

format ( Ix, ' Number of words=',i5) 



end 
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SEQUENCE LISTING 

GENERAL INFORMATION: 

(i) APPLICANT: Sydney Brenner 

(ii) TITLE OF INVENTION: Simultaneous Seauencinq of Tagged 
Polynucleotides 

(iii) NUMBER OF SEQUENCES: 27 

(iv) CORRESPONDENCE ADDRESS: 
(A) ADDRESSEE: Stephen C. Macevicz, Lynx Therapeutics, 



Inc . 



(BJ STREET: 3832 Bay Center Place 

(C) CITY: Hayward 

(D) STATE: California 

(E) COUNTRY: USA 

(F) ZIP: 94545 



(v) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: 3.5 inch diskette 

(B) COMPUTER: IBM compatible 

(C) OPERATING SYSTEM: Windows 3.1 

(D) SOFTWARE: Microsoft Word ver . 5.1 



(vi) CURRENT APPLICATION DATA: 

(A) APPLICATION NUMBER: 

(B) FILING DATE: 

(C) CLASSIFICATION: 



(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: 08/560,313 

(B) FILING DATE: 17-NOV-95 



(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: 08/611,155 

(B) FILING DATE: 05-MAR-96 

(viii) ATTORNEY/AGENT INFORMATION: 

(A) NAME: Stephen C. Macevicz 

(B) REGISTRATION NUMBER: 30,285 

(C) REFERENCE/DOCKET NUMBER: 807wo 



(ix) TELECOMMUNICATION INFORMATION; 

(A) TELEPHONE: (510) 670-9365 

(B) TELEFAX: (510) 670-9302 



(2) INFORMATION FOR SEQ ID NO: 1; 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 
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(xil SEQUENCE DESCRIPTION: SEQ ID NO: 



GGAAGAGGAA GAGGAAGAYY YN 

(2) INFORMATION FOR SEQ ID NO: 2: 



1) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi} SEQUENCE DESCRIPTION: SEQ ID NO: 

GAAGAGGAAG AGGAAGAGYY YN 

(2) INFORMATION FOR SEQ ID NO: 3: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: : 

AAGAGGAAGA GGAAGAGGYY YN 
(2) INFORMATION FOR SEQ ID NO: 4: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4 

AGAGGAAGAG GAAGAGGAYY YN 
(2) INFORMATION FOR SEQ ID NO: 5: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: S 



GAGGAAGAGG AAGAGGAAYY YN 
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(2) INFORMATION FOR SEQ ID NO: 6: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 22 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 
(Dl TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 



AGGAAGAGGA AGAGGAAGYY YN 



(2) INFORMATION FOR SEQ ID NO: 7: 



(i) SEQUENCE CHARACTERISTICS: 

(aT LENGTH : 1*9 TiucYeoVi'des 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 



AAAGAAAGGA AGGGCAGCT 



(2) INFORMATION FOR SEQ ID NO: 8: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 16 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 



NNGGATGNNN NNNNNN 



(2) INFORMATION FOR SEQ ID NO: 9: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 nucleotides 

(B) TYPE: nucleic acid 

_ .(C) STRANDEDNESS: double 
(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 



ACCAGCTGAT C 



(2) INFORMATION FOR SEQ ID NO: 10: 
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(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 nucleotides 

(B) TYPE: nucleic acid 
(C; STRANDEDNESS: double 
(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 1 



NRRGATCYNN N 



(2) INFORMATION FOR SEQ ID NO; 11: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 48 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 1 



GAGGATGCCT TTATGGATCC ACTCGAGATC CCAATCCA 



(2) INFORMATION FOR SEQ ID NO: 12: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 

(B) TYPE; nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 1 



AGTGGCTGGG CATCGGACCG 



(2) INFORMATION FOR SEQ ID NO; 13: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY; linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 13 



GGGGCCCAGT CAGCGTCGAT 



(2) INFORMATION FOR SEQ ID NO: 14: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 62 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 
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(xi) SEQUENCE DESCRIPTION: SEQ ID NO: lA: 



AAAAGGAGGA GGCCTTGATA GAGAGGACCT GTTTAAA.CGG ATCCGCTGCT 



50 



TTTTCCTCCT CC 



62 



(2) INFORMATION FOR SEQ ID NO: 15: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(XI) SEQUENCE DESCRIPTION: SEQ ID NO: 15: 



NNNNNNNNAA AAGGAGGAGG CCTTGA 2 6 



(2) INF0R^4ATI0N FOR SEQ ID NO: 16: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 



NNNNNNNGGA GGAGGAAAAA GCAGCN 26 



(2) INFORMATION FOR SEQ ID NO: 17: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 17: 



NNNNNNGGAG GAGGAAAAAG CAGCNN 2 6 



(2) INFORMATION FOR SEQ ID NO; 18: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 
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(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 18: 



NNNNNGGAGG AGGAAAAAGC AGCNNN 2 6 



(2) INFORMATION FOR SEQ ID NO: 19: 



(i- SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19: 



NNNNGGAGGA GGAAAAAGCA GCNNNN 2 6 



(2) INFORMATION FOR SEQ ID NO: 20: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 20: 



NNNGGAGGAG GAAAAAGCAG CNNNNN ' 2 6 



(2) INFORMATION FOR SEQ ID NO: 21: 

(1) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid ; 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 21: 

NNGGAGGAGG AAAAAGCAGC NNNNNN 2 6 

(2) INFORMATION FOR SEQ ID NO: 22: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 22: 



-65- 



wo 97/32999 



PCT/US96/187a8 



NGGAGGAGGA AAAAGCAGCN NNNNNN 2 6 



(2) INFORMATION FOR SEQ ID NO: 23: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 26 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(XI) SEQUENCE DESCRIPTION: SEQ ID NO: 23: 



GGAGuAGGAA AAAGCAGCNN NNNNNN 2 6 



M2) INFORMATION FOR SEQ ID NO: 24: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 43 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 24: 



NNNNNNNNAG AGAGAGAGAG GAGAGAGAGA GTAGAGAGGA CCG 4 3 



(2) INFORMATION FOR SEQ ID NO: 25: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 12 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 25: 



CGGUCCUCUC UA 12 



(2) INFORMATION FOR SEQ ID NO: 26: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 18 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 26: 



AAAAGGAGGA GGCCTTGA 18 
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(2) INFORMATION FOR SEQ ID NO: 27: 

;i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 18 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 27: ' 
GGAGGAGGAA AAAGCAGC 18 
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1 claim: 

1 . A method for simultaneously identifying one or more terminal nucleotides of 
polynucleotides in a population of polynucleotides, the method comprising the steps 

5 of: 

(a) attaching an oligonucleotide tag from a repertoire of tags to each 
polynucleotide of the population to form tag-polynucleotide conjugates such that 
substantially all different polynucleotides have different oligonucleotide tags attached, 
the oligonucleotide tag being selected from the same minimally cross-hybridizing set; 
1 0 (b) providing a label for each oligonucleotide tag, the label identifying one or 

more terminal nucleotides of the polynucleotide to which an oligonucleotide tag is 
conjugated; 

(c) transferring the oligonucleotide tags or copies thereof from the tag- 
polynucleotide conjugates to a spatially addressable array of tag complements so that 

1 5 the oligonucleotide tags or copies thereof specifically hybridize to their respective tag 
complements; 

(d) detecting the labels of the oligonucleotide tags or copies thereof on the 
spatially addressable array for the identification of the one or more terminal 
nucleotides of the polynucleotides in the population. 

20 

2. The method of claim 1 further including the steps of (e) shortening said 
polynucleotides; and (0 repeating said steps (b) through (e). 

3. The method of claim 2 wherein said step of providing said label includes 

25 amplifying said tag-polynucleotide conjugates by a polymerase chain reaction using a 
first primer and a second primer, the second primer having a defined 3' terminal 
nucleotide and forming a duplex with a primer binding site and one or more terminal 
nucleotides at one end of the tag-polynucleotide conjugate, such that a tag- 
polynucleotide conjugate is amplified only if the defined 3' terminal nucleotide 

30 basepairs with the one or more nucleotides at the end of the tag-polynucleotide 
conjugate. 

4. The method of claim 3 wherein said step of shortening includes providing a 
nuclease and a nuclease recognition site in said duplex formed between said second 

35 primer and said primer binding site, the nuclease having a recognition site separate 
fi-om its cleavage site, the recognition site being positioned to permit cleavage of said 
tag-polynucleotide conjugate, whereby said one or more nucleotides at said end of 
said tag-polynucleotide conjugate are removed. 
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5. The method of claim 4 wherein said step of shortening further includes 
ligating an adaptor to said end of said tag-polynucleotide conjugate after said cleavage 
by said nuclease, the adaptor containing said primer binding site and said nuclease 

5 recognition site. 

6. The method of claim 5 wherein said oligonucleotide tag is single stranded and 
consists of a plurality of subunits, each subunit consisting of an oligonucleotide of 3 
to 9 nucleotides in length and each subunit being selected from the same minimally 

1 0 cross-hybridizing set, 

1, The method of claim 6 wherein said repertoire of said oligonucleotide tags 
contains at least 100 of said oligonucleotide tags. 

1 5 8. The method of claim 7 wherein said subunits of said oligonucleotide tags are 
oligonucleotides each having a length between 4 and 9 nucleotides, and wherein each 
of said oligonucleotide tag differs from every other oligonucleotide lag of said same 
minimally cross-hybridizing set by at least three nucleotides. 

20 9. The method of claim 2 said step of providing said label includes (i) providing 
a set of primers, each primer of the set having a terminal nucleotide, a template 
positioning segment, and an extension region comprising one or more complexity- 
reducing nucleotides or complements thereof; (ii) fomiing a plurality of templates 
comprising a primer binding site and said tag-polynucleotide conjugates, the primer 

25 binding site being complementary to at least one primer of the set; (iii) fomiing 
amplicons from the templates by amplifying double stranded DNAs selectively 
formed by extending a primer from the set whose extension region forms a perfectly 
matched duplex with the primer binding site of the template; and (iv) labeling said 
tags in accordance with the terminal nucleotide of the primer used in said extension 

30 and amplification. 

1 (' ' The method of claim 9 wherein said step of shortening said polynucleotides is 
carried out by mutating said primer binding site of said template by extending and 
amplifying each of said double stranded DNAs with said primer whose said template 
35 positioning segment contains a nucleotide mismatched with its adjacent nucleotide in 
said primer binding site of said double stranded DNAs so that the identity of the 
adjacent nucleotide is changed in said amplicons by oligonucleotide-directed 
mutagenesis using said primer. 
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1 1 . The method of claim 1 0 wherein said complexity-reducing nucleotide is 
deoxyinosine and wherein said polymerase chain reaction and said extending (o form 
said double stranded DNA are carried out in the presence of deoxyadenosine 

5 triphosphate, deoxycytidine triphosphate, deoxyinosine triphosphate, and thymidine 
triphosphate. 

12. A method for simultaneously determining the nucleotide sequences of a 
population of polynucleotides, the method comprising the steps of: 

1 0 (a) attaching an oligonucleotide lag from a repertoire of tags to each 

polynucleotide of the population to form tag-polynucleotide conjugates such that 
substantially all different polynucleotides have different oligonucleotide tags attached, 
the oligonucleotide tags being selected from the same minimally cross-hybridizing 
set; 

1 5 (b) ampliiying the tag-polynucleotide conjugates by a polymerase chain 

reaction using a first primer and a second primer, the second primer having a defined 
3' terminal nucleotide and forming a duplex with a primer binding site and one or 
more nucleotides at one end of the tag-polynucleotide conjugate, such that a tag- 
polynucleotide conjugate is amplified only if the defined 3' terminal nucleotide 

20 basepairs with a nucleotide of the tag-polynucleotide conjugate; 

(c) labeling the oligonucleotide tags according to the identity of the defined 3' 
terminal nucleotide of the second primer; 

(d) copying the oligonucleotide tags from the tag-polynucleotide conjugates; 

and 

25 (e) sorting the labeled oligonucleotide tags onto a spatially addressable array 

of tag complements for detection. 

13. The method of claim 7 further including the steps of (0 shortening said 
polynucleotides; and (g) repeating said steps (b) through (f). 

30 

14. The method of claim 13 wherein said step of shortening includes providing a 
nuclease aiid'a nuclease recognition site in said duplex formed between said second 
primer and said primer binding site, the nuclease having a recognition site separate 
from its cleavage site, the recognition site being positioned to permit cleavage of said 

35 tag-polynucleotide conjugate, whereby said one or more nucleotides at said end of 
said tag-polynucleotide conjugate are removed. 
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15. The method of claim 14 wherein said step of shortening further includes 
ligating an adaptor to said end of said tag-polynucleotide conjugate after said cleavage 
by said nuclease, the adaptor containing said primer binding site and said nuclease 
recognition site. 

5 

16. The method of claim 15 wherein said oligonucleotide tag is single stranded 
and consists of a plurality of subunits, each subunit consisting of an oligonucleotide of 
3 to 9 nucleotides in length and each subunit being selected from the same minimally 
cross-hybridizing set. * 

10 

1 7. The method of claim 1 6 wherein said repertoire of said oligonucleotide tags 
contains at least 100 of said oligonucleotide tags. 

1 8. The method of claim 1 7 wherein said subunits of said oligonucleotide tags are 
1 5 oligonucleotides each having a length between 4 and 9 nucleotides, and wherein each 

of said oligonucleotide lag differs from every other oligonucleotide tag of said same 
minimally cross-hybridizing set by at least three nucleotides, 

19. The method of claim 12 wherein said second primer comprises a set of 
20 primers, each primer of the set having a defined 3* terminal nucleotide, a template 

positioning segment, and an extension region comprising one or more complexity- 
reducing nucleotides or complements thereof; and wherein said step of amplifying 
further includes: 

forming a plurality of templates comprising said primer binding sites and said 
25 tag-polynucleotide conjugates, the primer binding sites being complementary to at 
least one primer of the set; and 

forming amplicons from the templates by amplifying double su^ded DNAs 
selectively formed by extending a primer from the set whose extension region forms a 
perfectly matched duplex with said primer binding site of the template. 

30 

20. The method of claim 19 wherein said step of shortening said polynucleotides is 
carried out-by mutating said primer binding site of said template by extending and 
amplifying each of said double stranded DNAs with said primer whose said template 
positioning segment contains a nucleotide mismatched with its adjacent nucleotide in 
35 said primer binding site of said double stranded DNAs so that the identity of the 
adjacent nucleotide is changed in said amplicons by oligonucleotide-directed 
mutagenesis using said primer. 
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21 . The method of claim 20 wherein said complexity-reducing nucleotide is 
deoxyinosine and wherein said polymerase chain reaction and said extending lo form 
said double stranded DNA are carried out in the presence of deoxyadenosine 
triphosphate, deoxycytidine triphosphate, deoxyinosine triphosphate, and thymidine 

5 triphosphate. 

22. A method of identifying a population of mRNA molecules, the method 
comprising the steps of: 

(a) forming a population of cDNA molecules from the population of mRNA 
10 molecules such that each cDNA molecule has an oligonucleotide tag attached, the 

oligonucleotide tags being selected from the same minimally cross-hybridizing set; 

(b) removing a sample of cDNA molecules from the population such that 
substantially all different cDNA molecules in the sample have different 
oligonucleotide tags attached; 

1 5 (c) providing a label for each oligonucleotide tag in the sample, the label 

identifying one or more terminal nucleotides of the cDNA molecule to which the 
oligonucleotide tag is attached; 

(d) transferring the oligonucleotide tags or copies thereof from the cDNA 
molecules to a spatially addressable array of tag complements so that the 

20 oligonucleotide tags or copies thereof specifically hybridize to their respective tag 
complements; 

(e) detecting the labels of the oligonucleotide tags or copies thereof on the 
spatially addressable array for the identification of the one or more terminal 
nucleotides of the cDNA molecules in the population; 

25 (f) shortening the cDNA molecules by removing the identified one or more 

terminal nucleotides; 

(g) repeating steps (c) through (0 until a portion of the sequence of each 
cDNA molecule is determined; and 

(h) identifying the population of mRNA molecules by a frequency distribution 
30 of the portions of sequences of the cDN A molecules. 

23 . ^ The method of claim' 22 "wherein said step of providing said label includes : 

amplifying said cDNA molecules by a polymerase chain reaction using a first 
primer and a second primer, the second primer having a defined 3' terminal nucleotide 
35 and forming a duplex with a primer binding site and one or more nucleotides at one 
end of each of said cDNA molecules, such that a cDNA molecule is amplified only if 
the defined 3' terminal nucleotide basepairs with a nucleotide of the cDNA molecule; 
and 
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labeling said oligonucleotide tags according to the identity of the defined 3' 
terminal nucleotide of the second primer. 

24. The method of claim 23 wherein said second primer comprises a set of 

5 primers, each primer of the set having a defined 3' terminal nucleotide, a template 
positioning segment, and an extension region comprising one or more complexity- 
reducing nucleotides or complements thereof; and wherein said step of amplifying 
further includes: 

forming a plurality of templates comprising said primer binding sites and said 
1 0 cDNA molecules, the primer binding sites being complementary to at least one primer 
of the set; and 

forming amplicons from the templates by amplifying double stranded DNAs 
selectively formed by extending a primer from the set whose extension region forms a 
perfectly matched duplex with said primer binding site of the template. 

15 

25. The method of claim 24 wherein said step of shortening said polynucleotides is 
carried out by mutating said primer binding site of said template by extending and 
amplifying each of said double stranded DNAs with said primer whose said template 
positioning segment contains a nucleotide mismatched with its adjacent nucleotide in 

20 said primer binding site of said double stranded DNAs so that the identity of the 
adjacent nucleotide is changed in said amplicons by oligonucleotide-directed 
mutagenesis using said primer. 

26. The method of claim 23 wherein said step of shortening includes providing a 
25 nuclease and a nuclease recognition site in said duplex formed between said second 

primer and said primer binding site, the nuclease having a recognition site separate 
from its cleavage site, and the recognition site being positioned to permit cleavage of 
said cDNA molecules, whereby said one or more nucleotides at said end of each of 
said cDNA molecules are removed. 



30 



27. The method of claim 26 wherein said step of shortening further includes 
ligating an adaptor to said end of said cDNA molecule after said cleavage by "said 
nuclease, the adaptor containing said primer binding site and said nuclease 
recognition site. 



35 
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